Effective Data Visualization for other Humans

Choosing the Right Visual Matters

Credit: Harry Quan — Unsplash

Irrespective of the accuracy of its content, choosing the wrong chart can convey misleading and potentially harmful messages to your audience. We have the responsibility to uphold the integrity of the messages we send. Additionally, using the wrong visual can misclassify otherwise accurate information. We can minimize these errors by classifying our data under the correct data visualization.

Bar Charts for Comparing Numeric Quantities

A practical approach to compare quantities between categories is to use a bar chart. In our data set, we can compare the number of occupation types among our collection of occupations.

The x-axis represents the different categories, and the y-axis indicates the unit of measurement.

Bar charts are easy to interpret and allows the end user to quickly compare the number or frequency of different discrete categories of data. Discrete data refers to something we count, rather than measure. For example, we have six wc occupation types. We don’t have 6.3333 wc occupation types.

Here’s the code used to generate the bar graph in Python3.

from matplotlib import pyplot as plt
typeCounts = df['type'].value_counts()
typeCounts.plot(kind='bar', title='Type Counts')
plt.ylabel('Number of Occupation Type')

Histograms for Visualizing Distributions for Continuous Data

Continuous data refers to measurements along a scale. In contrast to Discrete data, Continuous data, or quantitative data, is represented by values that are measured rather than something we count. For our data set, percentage values for income, education, and prestige are recorded to quantify a measurement on a category in the data sample.

Using a histogram, we can effectively show our viewers which percentage group yielded the highest frequency to graduate high school. Using histograms, we can effectively show the distribution or shape of our set of continuous data.

Our distribution of high school graduates shows more occupations were more common in the lower percentage groups of students who graduated from high school.

Here’s the code used to generate the histogram in Python3.

from matplotlib import pyplot as plt
df['education'].plot.hist(title='High School Graduates', bins=19)

Pie Charts for Comparing Relative Quantities

Pie charts can be useful when you are trying to compare parts of a whole. Compared to bar charts, pie charts can be used to define a visual representation of the full context of a particular category. We can use this to compare relative quantities by separating these values into categories. When used effectively, pie charts can effectively display a proportion to the whole set of data.

Comparing the frequencies between occupation types from the bar chart, we can generate a more effective visualization for proportions using a pie chart.

In comparison to the bar chart, proportions are easier to identify in a pie chart

Scatter Plots for Comparing Quantitative Values

We can classify quantitive data as something you can categorize. In other words, this is information that contains unique qualities that can you can identify and categorize your data set into visible parts.

Scatter plots are compelling when you want to visualize and explore relationships between numeric features. Additionally, scatter plots are useful in identifying data points that are outliers, values that are significantly outside the range of your observed data set. Let’s take a look at the occupations data by visualizing how prestige compares to income levels.

Notice the top left and bottom right are mostly empty. Higher income generated higher prestige evaluated, and low-income occupations made less prestige.

With this method, you can effectively demonstrate how two variables are correlated using a scatter plot. You can adequately visualize the extent of a relationship and highlight to a viewer how one variable may affect another.

Here is the corresponding code to generate the scatter plot:

from matplotlib import pyplot as plt
occupations = df[['prestige', 'income']]
title='Occupations', x='prestige', y='income')
plt.ylabel('Avg Income %')

Line Charts for Changes in Values

We often want to evaluate our data sets over some time. Line Charts are a useful visualization technique to how values change along with a series.

For this example, we’ll analyze US macroeconomic variables between the years 1947–1962 from the Longley Dataset from the Datasets Package.

import statsmodels.api as sm
df = sm.datasets.longley.load_pandas().data
0 60323.0 83.0 234289.0 2356.0 1590.0 107608.0 1947.0
1 61122.0 88.5 259426.0 2325.0 1456.0 108632.0 1948.0
2 60171.0 88.2 258054.0 3682.0 1616.0 109773.0 1949.0
3 61187.0 89.5 284599.0 3351.0 1650.0 110929.0 1950.0
4 63221.0 96.2 328975.0 2099.0 3099.0 112075.0 1951.0
5 63639.0 98.1 346999.0 1932.0 3594.0 113270.0 1952.0
6 64989.0 99.0 365385.0 1870.0 3547.0 115094.0 1953.0
7 63761.0 100.0 363112.0 3578.0 3350.0 116219.0 1954.0
8 66019.0 101.2 397469.0 2904.0 3048.0 117388.0 1955.0
9 67857.0 104.6 419180.0 2822.0 2857.0 118734.0 1956.0
10 68169.0 108.4 442769.0 2936.0 2798.0 120445.0 1957.0
11 66513.0 110.8 444546.0 4681.0 2637.0 121950.0 1958.0
12 68655.0 112.6 482704.0 3813.0 2552.0 123366.0 1959.0
13 69564.0 114.2 502601.0 3931.0 2514.0 125368.0 1960.0
14 69331.0 115.7 518173.0 4806.0 2572.0 127852.0 1961.0
15 70551.0 116.9 554894.0 4007.0 2827.0 130081.0 1962.0

Reviewing the types of data available here, we have a set of 16 observations containing the following six features:

  • TOTEMP: Total Employment
  • GNPDEFL: GNP deflator
  • GNP: GNP (Gross National Product)
  • UNEMP: Number of unemployed
  • ARMED: Size of armed forces
  • POP: Population
  • YEAR: Year (1947–1962)

Let’s create our visualization of the population change for our sample data. With this approach, you can quickly convey to your audience trends or patterns within your data.

From this chart, we can see the population sizes increase from year to year and shows an increasing trend.

Here is the corresponding code to generate the line graph:

df.plot(title='Population', x='YEAR', y='POP')

Box Plots for Visualizing a Distribution

Also known as the box and whiskers plot, you can utilize this approach for data sets that only have one variable. This data referred to as Univariate data.

Let’s look at the unemployment data set from the Longley dataset as plain-text and a box plot.

Transforming patternless text into a perceptible visual form.
0 2356.0
1 2325.0
2 3682.0
3 3351.0
4 2099.0
5 1932.0
6 1870.0
7 3578.0
8 2904.0
9 2822.0
10 2936.0
11 4681.0
12 3813.0
13 3931.0
14 4806.0
15 4007.0

Box plots distribute data across quartiles and show where the data between the 25th and 75th percentile lie. Within the box of the plot, we can visualize the range of data within the 50th percentile of values along with the median, shown in the green line.

Here is the corresponding code for the box plot:

import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
dff = sm.datasets.longley.load_pandas().data
df = pd.DataFrame({'Unemployment': dff['UNEMP']})
print(df) # table