This is a short article about an illustration called Anscombe’s Quartet. It is an extreme example of how blind statistical analysis can trick you. It is also another reminder of the importance of visualising data in your EDA (Exploratory Data Analysis).
Below is a set of four distinct charts, collectively called Anscombe’s Quartet. These charts represent four different sets of data, with no obvious similarities between them. The first clue that there might be some similarities between them is that they all share similar looking trend lines, which will be explained later on.
Chart I) Evenly distributed data points showing a clear linear trend;
Chart II) Evenly distributed data points showing a clear polynomial trend;
Chart III) A clear linear trend with one outlier that looks like it is skewing the line-of-best-fit; and
Chart IV) Grouped data points along the x-axis with no trend
Even from a quick glance, it is obvious that these charts represent diverse datasets and should be understood differently. From a predictive modelling point of view, the charts indicate what sorts of analysis could be used to generate predictions about new points of data, e.g. Chart III looks like it could be modelled using linear regression and Chart IV could be a classification problem.
However, the statistical information about the same four charts paints a different and counterintuitive picture:
As can be seen in the table above, the Sum, Average and St.dev of the data points for each chart are identical.
So despite the fact the the data looks very diverse when visualised, each data set actually contains identical statistical properties.
This phenomenon crops up ‘in-the-field’ very regularly, and in many different guises. The most common occurrence is when relying on averages without understanding the distribution of the underlying data, e.g. where two data sets have similar averages but wildly different distributions (in this case, the data can be visualised in a boxplot or violin plot to compare averages and distributions).
A very simple illustration of this would be measuring the average monthly revenue for two businesses, a wooly coat shop and an icecream shop. Both could have the same average monthly revenue over a year but completely different seasonal patterns.
Another cause of this phenomenon is outliers. Outliers can skew a data set in a way that is hidden in its statistical information but obvious when the data is visualised (see Chart III above).
Using data visualisation, as shown in Chart III, is a solid starting point for checking for hidden outliers. However, when dealing with extremely large data sets, where the outliers might be granular and not show up in a data visualisation, another approach might be to use more complicated statistical techniques such as using z-scores.
Anscombe’s Quartet is another quick, useful reminder of why it’s important to use visualisations when exploring data and how high-level statistical information can be misleading when the data underlying those statistics is not fully understood. Remembering the message of Anscombe’s Quartet can help reduce critical errors when analysing and modelling data, where errors can be caused by a number of reasons including lack of understanding of the distribution of a dataset and outliers.