STATS 60 / STATS 160 / PSYCH 10
The purpose of data visualization is to communicate data to others. It is as much psychology and art, as it is math.
The following data set consists of flights departing from NYC airports in 2013.
There is too much information in a data set. We have to decide what to show and how to show it.
One way to visualize a categorical variable, such as carrier
, is to make a pie chart, which depicts the percentage of the whole that each category makes up.
The following graphs depict the categorical variable carrier
at each of the three origin
airports.
What interesting insights can you draw from these pie charts?
What’s wrong with the following “stylish” version of the above pie chart?
3D plots distort the numbers.
The proportional ink principle: “When a shaded region is used to represent a numerical value, the area of that shaded region should be directly proportional to the corresponding value.”
Humans are bad at judging angles and areas, so even “good” pie charts can mislead.
Here is a pie chart of the origin
airports from the flights data.
Which airport had the most flights? Which had the fewest?
On the other hand, humans are very good at judging lengths, so consider making a bar chart instead, where each value is represented by the length of a bar.
With a bar chart, it is dead obvious which airport had the most flights and which had the fewest.
Unlike pies, bars can be easily plotted side-by-side for easy comparison.
How many problems can you spot in the following bar chart?
Here is a more truthful visualization.
Tips for Reading Bar Charts
Florence Nightingale (1820-1910)
Diagram of the causes of mortality in the army in the East
“The blue wedges measured from the centre of the circle represent area for area the deaths from Preventable…diseases,
the red wedges measured from the centre the deaths from wounds, &
the black wedges measured from the centre the deaths from all other causes.”
Data collected over time is called a time series.
The change is completely obscured by the “correct” bar chart on the right!
Line charts can be a good compromise. They do not need to be anchored at 0. (Why not? Doesn’t the proportional ink principle apply to them?)
In fact, anchoring line charts at zero can be misleading, as shown in this graphic tweeted by the National Review.
A line chart of the same data drawn by the Washington Post is more alarming.
Sometimes line charts and bar charts are combined in a single visualization, as in the following climograph (of Kolkata, India), a visualization of a location’s basic climate.
What is wrong with making a bar chart of a quantitative variable, like eruptions
?
What kind of visualization would you make instead?
A histogram is a more appropriate visualization for a quantitative variable. First, values are sorted into bins, and the number of values in each bin is plotted as a bar.
What interesting insights can you draw from this histogram?
How is a histogram different from a bar chart?
The following histogram depicts the same eruptions
data.
How does it mislead?
Be wary of histograms with unequal bin widths!
We can also make a histogram of the waiting
time between eruptions.
But how do we understand the relationship between two quantitative variables?
In a scatterplot, each observation is represented by a point \((x, y)\). The \(x\)-coordinate represents the value of one variable, while the \(y\)-coordinate represents the value of the other.
What interesting insights can you draw from this scatterplot?