Recently, my sister asked the best way to produce some illustrative Lorenz curves for an essay she was writing. Now, I had heard of Lorenz curves but I had no idea what they actually were. Buckle-up for our first mathematical adventure!
Enter Max O. Lorenz, creator of the Lorenz curve, mostly interested in railways and such. Whilst, railways are surely interesting, and perhaps they were more interesting in the 1910s, Lorenz’s most obvious legacy is the curve bearing his name. The Lorenz curve allows us to picture ‘clearly in our minds the relative distribution of wealth’ according to Willford I. King – who originally coined the term.
The ‘First’ Lorenz Curve
So what is a Lorenz Curve you ask? We’ll come back to that later. (I did warn you this was an adventure). Let’s start with a seemingly unrelated problem from statistics.
Say I gave you the heights of a 100 people and asked you to verify that the heights came from a normal distribution, what would you do?
YOU: Easy! A student t-test!
ME: *rolls eyes* Sure, fine, yeah that’s true you could do that. How about a more visual way?
YOU: Easy! Plot a histogram!
ME: *sigh* Well yes, you could definitely do that. But how would you know the distribution was definitely normal?
YOU: No clue mate.
ME: *rubs hands* Well my friend, let me introduce you to the QQ plot.
The quantile-quantile plot is pretty much what it says on the tin. Given a theoretical distribution we can calculate quantiles. You are already very familiar with quantiles. For example, you are in the bottom decile for height, you are in the smallest 10% of people in the given population. If exactly 10% of people are under 150cm, then 150cm is the 10th quantile. Now if a good model for height is a normal distribution and we calculate the quantiles for both the normal distribution, and the empirical data , what will happen if we plot the points ?
Wala! The points lie on a roughly straight line? This is because we have the same shapes for the observed and empirical distributions. (Hopefully this will be clear with a bit of thought).
Right so what was that you said about wealth inequality?
It turns out the Lorenz curve is simply a special case of a qq plot. We still calculate the quantiles of the empirical distribution as before, however, this time we plot them against the quantiles from a uniform distribution. I.e. we plot the quantiles at regularly spaced intervals.
Why, then, are these curves so popular? If you’ve ever heard a politician (R.I.P. Jeremy Corbyn) mention something like, 10% of people in the U.K. own 40% of its wealth (real statistic btw), then you will know how powerful it is. Each point on this curve reads exactly that statistic. So with Lorenz Curves it is incredibly easy to visually compare different wealth distributions on the same scale. Given this, you may naturally wonder if this graph can give us a general statistic for inequality, the answer is yes!
The Gini in a Bottle
I first came across the Gini Coefficient when studying decision trees. A supervised machine learning algorithm designed to classify data. But before we get into that, another digression, this time into ROC (reciever operating characteristic) curves.
The receiver operating characteristic, widely used in Machine Learning for binary classification has a very interesting history:
The ROC curve was first used during World War II for the analysis of radar signals before it was employed in signal detection theory. Following the attack on Pearl Harbor in 1941, the United States army began new research to increase the prediction of correctly detected Japanese aircraft from their radar signals. For these purposes they measured the ability of a radar receiver operator to make these important distinctions, which was called the Receiver Operating Characteristic. – Wikipedia
More recently the ROC is used to compare different binary classifications, with a key metric being the Area Under the Curve (AUC). The ROC is the plot of the True Positive Rate (TPR or sensitivity/recall) against the False Positive Rate (FPR or 1 – specificity) whilst varying the classification threshold. Or using the terminology of hypothesis testing, the ‘power’ against ‘type 1 error’. Phew, why do we have so many terms for the same quantity?
So what does the AUC represent? The AUC is the probability a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. So AUC can be used as a general metric for comparing binary classifiers (a score of 1, would be a perfect classifier). Alright so how could all this possibly be related to Lorenz Curves? To answer this, lets finally define the Gini Coefficient.
The gini coefficient is defined as twice the area between the Lorenz Curve and the straight line. A high Gini score (the highest clearly being 1) represents a very unequal wealth distribution (the richest person has all the wealth).
Now, it turns out that the ROC curve is the inverse of the Lorenz curve, and thus the gini coefficient is extremely closely related to the AUC score, . So given this new perspective, we can think of the Lorenz curve as implicitly encoding the probability an individual is wealthier than they should be (if you think wealth should be distributed completely equally of course).
To go further down the rabbit hole, see my post on Entropy. Can you separate the wood from the Decision Trees?