December 8, 2017 | Xing Brew
The Datasaurus Dozen

Understanding the importance of data visualization

In an increasingly data-driven society and with the growing prominence of data science, data visualization has become an art form of its own. There are university degrees, jobs, international competitions, and conferences centered around the art of data visualization. But as researchers Justin Matejka and George Fitzmaurice creatively illustrate with the “Datasaurus Dozen”, visualizing data is more than just aesthetics.

In a paper published earlier this year, Matejka and Fitzmaurice demonstrate how 12 datasets that share some of the same statisitical properties (like mean, standard deviation, and Pearson’s correlation) look completely different when plotted. Their research builds on a principle demonstrated by statistician F.J. Anscombe in 1973 in what is known as the Anscombe’s Quartet, a group of four datasets which all share the same summary statistics, yet look completely different in graphs.

Anscombe's Quartet

Source: Wikipedia

What is Datasaurus?
The researchers were inspired by Albert Cairo’s “Datasaurus”, an image that the data viz expert created using DrawMyData to remind people the importance of visualizing data. Cairo’s dataset looked normal when examining summary statistics, but when plotted, formed the image of a dinosaur. His message? Don't trust summary statistics.

Matejka and Fitzmaurice implemented the dataset Cairo used to create Datasaurus to form a series of 12 different shapes and patterns in which the summary statistics remain constant to two decimal places even as the points and patterns shift drastically.

Source: J. Matejka and G. Fitzmaurice (2017)

The researchers further illustrated the importance of visualizing data using the Simpson’s Paradox, in which a certain trend appears in several different clusters of data but disappears, or is reversed, when these groups are combined. Both datasets A and B (below) have the same overall Pearson's correlation coefficient (+0.81). However, after coercing the data towards the pattern of downward sloping lines (B), we can see that each cluster of data in has an individually negative correlation even as the correlation of the entire dataset remains positive.

Source: Adapted from J. Matejka and G. Fitzmaurice (2017)

Matejka and Fitzmaurice’s Datasaurus Dozen demonstrates that data visualization is more than just a superfluous addition to analysis or a way to make data look pretty. Rather, visualizing data is a crucial component of analysis because of its ability to reveal surprising or hidden patterns in data that bring us closer to the truth. Their research also implies how easily data can be manipulated to show or conceal certain trends or information, and thus serves as a good reminder of why we should look beyond summary statistics when drawing conclusions about data.

In addition to being an educational message, the researchers hope that their approach for creating visually dissimilar datasets which are equal over a range of statistical properties may be a starting point for new data anonymization techniques.

Back to Blogs →