If you picked a random country and measured its population, there’s roughly a 30.1% chance the first digit of that number is a ‘1’ and a 17.6% chance that it is a ‘2’. The distribution of these first digits is known as Benford’s Law.
Benford’s Law, also known as the ‘leading digit rule’, appears everywhere in economics, human geography, nature and sports, but few people have ever heard of it.
Why does this mathematical phenomenon exist? Where does it appear? So what?
What is Benford’s Law?
Think of all the numerical data that you have created over the last 7 days: bank transactions, journey times, journey distances, doorways walked through, milliliters of water drunk, how many words you’ve spoken.
For each of these numbers, Benford’s Law says that the likelihood of the first digit being a ‘1’ is not 11.1% (1/9), but is in fact around 30.1%. Benford’s Law sets out a distribution curve of leading digits, with ‘1’ being the most likely and ‘9’ being the least likely, as follows:
This distribution has been found to occur in a spooky amount of datasets. For example, below is the distribution of leading digits for the population of 240 countries. Although there are small divergences from Benford’s Law, the pattern is clear:
Here is the same analysis using annual GDP(US$):
When does this rule work?
Fully explaining why this phenomenon occurs is at the difficult difficult lemon difficult end of mathematics.
However, in general the rule will work with most statistical data which spans several orders of magnitude, e.g. 10^1-4.
Counter-intuitively, it’s probably easiest to understand the conditions where the distribution won’t be found:
Sequential numbers: Any sets of numbers which form a consecutive or formulaic pattern will not work, e.g. invoice numbers, dates. This is because these data sets can, and probably will, have arbitrary cut-offs, starting points, end-points. There is no reason why a list of invoice numbers can’t start at ‘2...’ or ‘3...’ or always start with a ‘2002...’
Max/min conditions: If the number set has limits and thresholds it can skew the distribution. For example, if you asked people to ‘pick a number between 150-550’, you would not get a Benford’s Law distribution
Artificial clusters: Sometimes humans like to make their life easy and measure things with convenient scales. Take human height as an example. The average UK woman is 162 cm tall, with very few being under 100cm or over 199cm and none being over 299cm
Human bias: Data created by human decision making can carry inherent biases. For example, humans are influenced by price thresholds when making purchasing decisions, which is why many prices are £x.99. This is also true when humans pick ‘random’ numbers, where people will disproportionately choose 3 or 7 when picking a number between 1-10
‘Building block’ numbers: Some recorded numbers are the result of combining other numbers. If you were looking at the distribution of leading digits of bets in a poker competition, the results might not follow a Benford’s Law distribution because each bet is built up from fixed chip values. If the smallest chip size is $25, bets of $10, $20 and $30 would be impossible. Another example might be a fast food restaurant with a small menu, where there are relatively few combinations of items that can form a transaction value.
How can I use Benford’s Law?
When it was initially discovered in the 19th century and rediscovered in the mid-20th century, Benford’s Law was filed away under ‘interesting but not useful’. In the pre-computer era, data collection and analysis was slow and painful.
Today, it is extremely straightforward to check whether a dataset has a Benford’s Law distribution of leading digits. I think that Benford’s Law is an incredibly useful tool that can quickly help you understand the characteristics of a data set and identify hidden biases.
The most common uses of Benford’s Law in data and analytics are:
Fraud: Benford’s Law became fashionable again when people realised its use in fraud detection. This follows from the human bias point made above where humans will make strange decisions when choosing ‘random’ numbers.
When I was in forensic accounting, we used Benford’s Law to sense check accounting software entries. If large numbers of transactions had been entered ‘randomly’ by a human there would be a distinct divergence away from a Benford’s Law distribution.
This is an example of what the leading digit distribution pattern of a human entered dataset might look like versus Benford’s Law:
Bias: This is the most useful application of Benford’s Law. It can help detect hidden biases in a data set, possibly caused by some of the conditions outlined above under which Benford’s Law won’t work.
This is incredibly useful for data analysis and creating machine learning models because it helps remove unidentified biases from datasets that can lead to a misleading, at best, or fundamentally flawed models.
For example, you take a dataset of credit card transactions and compare the leadingdigit distribution to Benford’s Law. The analysis shows that there is a higher than expected proportion of ‘5’s and ‘6’s as the leading digit of the transactions. On further investigation, you find out that in December 2018 the credit card company ran a promotion for transactions on electronic goods of £500 or above. These transactions are excluded from the model where appropriate.
Benford’s Law, or the ‘leading digit rule’, is a solid combination of being interesting and useful. It's also much more applicable than given credit for.
Not only is it a quick and high level data exploration tool which can help understand a dataset, it can be a robust forensic technique that can be and has been used as legal evidence.
Understanding the characteristics of datasets under which Benson’s Law won’t work is the key to applying and using it in data and analytics.