# Effective Data Visualization for other Humans

### Choosing the Right Visual Matters

Irrespective of the accuracy of its content, choosing the wrong chart can convey misleading and potentially harmful messages to your audience. We have the responsibility to uphold the integrity of the messages we send. Additionally, using the wrong visual can misclassify otherwise accurate information. We can minimize these errors by classifying our data under the correct data visualization.

Bar Charts for Comparing Numeric Quantities

A practical approach to compare quantities between categories is to use a bar chart. In our data set, we can compare the number of occupation types among our collection of occupations.

Bar charts are easy to interpret and allows the end user to quickly compare the number or frequency of different discrete categories of data. Discrete data refers to something we count, rather than measure. For example, we have six wc occupation types. We don’t have 6.3333 wc occupation types.

Here’s the code used to generate the bar graph in Python3.

`from matplotlib import pyplot as plt`
`typeCounts = df['type'].value_counts()typeCounts.plot(kind='bar', title='Type Counts')`
`plt.xlabel('Type')plt.ylabel('Number of Occupation Type')plt.show()`

#### Histograms for Visualizing Distributions for Continuous Data

Continuous data refers to measurements along a scale. In contrast to Discrete data, Continuous data, or quantitative data, is represented by values that are measured rather than something we count. For our data set, percentage values for income, education, and prestige are recorded to quantify a measurement on a category in the data sample.

Using a histogram, we can effectively show our viewers which percentage group yielded the highest frequency to graduate high school. Using histograms, we can effectively show the distribution or shape of our set of continuous data.

Here’s the code used to generate the histogram in Python3.

`from matplotlib import pyplot as plt`
`df['education'].plot.hist(title='High School Graduates', bins=19)plt.xlabel('Percentage')plt.ylabel('Frequency')plt.show()`

#### Pie Charts for Comparing Relative Quantities

Pie charts can be useful when you are trying to compare parts of a whole. Compared to bar charts, pie charts can be used to define a visual representation of the full context of a particular category. We can use this to compare relative quantities by separating these values into categories. When used effectively, pie charts can effectively display a proportion to the whole set of data.

Comparing the frequencies between occupation types from the bar chart, we can generate a more effective visualization for proportions using a pie chart.

#### Scatter Plots for Comparing Quantitative Values

We can classify quantitive data as something you can categorize. In other words, this is information that contains unique qualities that can you can identify and categorize your data set into visible parts.

Scatter plots are compelling when you want to visualize and explore relationships between numeric features. Additionally, scatter plots are useful in identifying data points that are outliers, values that are significantly outside the range of your observed data set. Let’s take a look at the occupations data by visualizing how prestige compares to income levels.

With this method, you can effectively demonstrate how two variables are correlated using a scatter plot. You can adequately visualize the extent of a relationship and highlight to a viewer how one variable may affect another.

Here is the corresponding code to generate the scatter plot:

`from matplotlib import pyplot as plt`
`occupations = df[['prestige', 'income']]occupations.plot(kind='scatter', title='Occupations', x='prestige', y='income')`
`plt.xlabel('Prestige')plt.ylabel('Avg Income %')plt.show()`

#### Line Charts for Changes in Values

We often want to evaluate our data sets over some time. Line Charts are a useful visualization technique to how values change along with a series.

For this example, we’ll analyze US macroeconomic variables between the years 1947–1962 from the Longley Dataset from the Datasets Package.

`import statsmodels.api as sm`
`df = sm.datasets.longley.load_pandas().datadf`
`[TOTEMP]  [GNPDEFL] [GNP]   [UNEMP]  [ARMED]  [POP]    [YEAR]0   60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.01   61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.02   60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.03   61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.04   63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.05   63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.06   64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.07   63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.08   66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.09   67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.010  68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.011  66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.012  68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.013  69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.014  69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.015  70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0`

Reviewing the types of data available here, we have a set of 16 observations containing the following six features:

• TOTEMP: Total Employment
• GNPDEFL: GNP deflator
• GNP: GNP (Gross National Product)
• UNEMP: Number of unemployed
• ARMED: Size of armed forces
• POP: Population
• YEAR: Year (1947–1962)

Let’s create our visualization of the population change for our sample data. With this approach, you can quickly convey to your audience trends or patterns within your data.

Here is the corresponding code to generate the line graph:

`df.plot(title='Population', x='YEAR', y='POP')`
`plt.xlabel('Year')plt.ylabel('Population')plt.show()`

#### Box Plots for Visualizing a Distribution

Also known as the box and whiskers plot, you can utilize this approach for data sets that only have one variable. This data referred to as Univariate data.

Let’s look at the unemployment data set from the Longley dataset as plain-text and a box plot.

`[Unemployment]0         2356.01         2325.02         3682.03         3351.04         2099.05         1932.06         1870.07         3578.08         2904.09         2822.010        2936.011        4681.012        3813.013        3931.014        4806.015        4007.0`

Box plots distribute data across quartiles and show where the data between the 25th and 75th percentile lie. Within the box of the plot, we can visualize the range of data within the 50th percentile of values along with the median, shown in the green line.

Here is the corresponding code for the box plot:

`import pandas as pdimport statsmodels.api as smfrom matplotlib import pyplot as plt`
`dff = sm.datasets.longley.load_pandas().datadf = pd.DataFrame({'Unemployment': dff['UNEMP']})print(df) # table`
`plt.figure()df['Unemployment'].plot(kind='box',title='Unemployment')`