Lecture 11: Data Visualization

STATS 60 / STATS 160 / PSYCH 10

The Goal of Data Visualization

The purpose of data visualization is to communicate data to others. It is as much psychology and art, as it is math.

The following data set consists of flights departing from NYC airports in 2013.

There is too much information in a data set. We have to decide what to show and how to show it.

Categorical Variables

Pie Charts

One way to visualize a categorical variable, such as carrier, is to make a pie chart, which depicts the percentage of the whole that each category makes up.

The following graphs depict the categorical variable carrier at each of the three origin airports.

What interesting insights can you draw from these pie charts?

Lying with Pie Charts

What’s wrong with the following “stylish” version of the above pie chart?

3D plots distort the numbers.

The proportional ink principle: “When a shaded region is used to represent a numerical value, the area of that shaded region should be directly proportional to the corresponding value.”

The Problem with Pie Charts

Humans are bad at judging angles and areas, so even “good” pie charts can mislead.

Here is a pie chart of the origin airports from the flights data.

Which airport had the most flights? Which had the fewest?

Bar Charts

On the other hand, humans are very good at judging lengths, so consider making a bar chart instead, where each value is represented by the length of a bar.

With a bar chart, it is dead obvious which airport had the most flights and which had the fewest.

Grouped Bar Charts

Unlike pies, bars can be easily plotted side-by-side for easy comparison.

Lying with Bar Charts

How many problems can you spot in the following bar chart?

Lying with Bar Charts

Here is a more truthful visualization.

Tips for Reading Bar Charts

  • Check the axes. The bars should always start at 0 to satisfy the “proportional ink principle”.
  • Check for extraneous variation, such as both height and width varying.

Data Visualization in History

Florence Nightingale (1820-1910)

  • Nightingale is best known as the founder of modern nursing.
  • But she was also a statistician, the first female member of the Royal Statistical Society in 1879.
  • She drew public attention to the importance of nursing by making visualizations, such as the one on the next slide, which depicts deaths during the Crimean War.

Diagram of the causes of mortality in the army in the East

“The blue wedges measured from the centre of the circle represent area for area the deaths from Preventable…diseases,

the red wedges measured from the centre the deaths from wounds, &

the black wedges measured from the centre the deaths from all other causes.”

Time Series Data

Time Series

Data collected over time is called a time series.

The change is completely obscured by the “correct” bar chart on the right!

Line Charts

Line charts can be a good compromise. They do not need to be anchored at 0. (Why not? Doesn’t the proportional ink principle apply to them?)

Lying with Line Charts

In fact, anchoring line charts at zero can be misleading, as shown in this graphic tweeted by the National Review.

Lying with Line Charts

A line chart of the same data drawn by the Washington Post is more alarming.

Combining Bar Charts and Line Charts

Sometimes line charts and bar charts are combined in a single visualization, as in the following climograph (of Kolkata, India), a visualization of a location’s basic climate.

Quantitative Variables

Quantitative Data

  • This is Old Faithful, a geyser in Yellowstone National Park.
  • It erupts every 40 to 100 minutes, each time lasting between 1.5 and 5 minutes.
  • Let’s look at some data about the length of the eruptions and the time between eruptions.

Bar Chart for Quantitative Data?

What is wrong with making a bar chart of a quantitative variable, like eruptions?

What kind of visualization would you make instead?

Histograms

A histogram is a more appropriate visualization for a quantitative variable. First, values are sorted into bins, and the number of values in each bin is plotted as a bar.

What interesting insights can you draw from this histogram?

A Histogram is Not a Bar Chart!

How is a histogram different from a bar chart?

Lying with Histograms

The following histogram depicts the same eruptions data.

How does it mislead?

Be wary of histograms with unequal bin widths!

Relationships between Variables

We can also make a histogram of the waiting time between eruptions.

But how do we understand the relationship between two quantitative variables?

Scatterplots

In a scatterplot, each observation is represented by a point \((x, y)\). The \(x\)-coordinate represents the value of one variable, while the \(y\)-coordinate represents the value of the other.

What interesting insights can you draw from this scatterplot?

Recap

  • In section tomorrow, you will learn to make these visualizations (by describing them using AI)!
  • Keep in mind the following as you make visualizations.
    • Is the variable categorical or quantitative?
    • Proportional ink principle