Monday, February 1, 2016

The Cumulative Distribution Function

The cumulative distribution function (CDF) is a long name for a simple concept---a concept you should become familiar with if you like to think about data.

One of the most basic visualizations of a set of numbers is the histogram: a plot of how frequently various values appear.  For example, measuring the heights of 100 people might yield a histogram like this:



This technique is taught to kids even in preschool, where teachers often record the weather (cloudy, rainy, sunny, etc.) on a chart each day.  Over several weeks, a picture of the frequency of sunny days, rainy days, etc, naturally emerges.  (Sometimes it seems as if the histogram is the only data visualization kids learn in school.)

The CDF is a different way to visualize the same data.  Instead of recording how often a particular value occurs, we record how often we see that value or less.  We can turn a histogram into a CDF quite simply.  Start at the left side of the height histogram: four people have a height in the 1-1.1 m range so clearly, four people have a height of 1.1 m or less. Now, we move up to the next bin: five people are in the 1.1-1.2 m range so including the four shorter people we have nine with height 1.2 m or less.  We then add these nine (the "or less" part) to the number in the next bin to obtain the number with height 1.3 m or less.  This total then becomes the number of "or less" people to add to the number of people at 1.4 m, and so on.  (This procedure is similar to integration in calculus.)  The final result is:


(Notice that this graph shows smaller details than the histogram; I'll explain that at the end.) What is this graph useful for?  If we want to know the percentage of people over 6 feet (1.8 m), we can now read it straight off the CDF graph!  Just go to 1.8 m, look up until you hit the curve, and then look horizontally to see where you hit the vertical axis.  In our example here, that is about 95%:



This means 95% of people are 6 feet or shorter; in other words 5% are taller than 6 feet.  Compared to the histogram, the CDF makes it blazingly fast to look up the percentage taller than 6 feet, shorter than 5 feet (1.5 m), or anything of that nature.  (Beware: I made up these data as a hypothetical example, so don't take this as an actual comment on human height.)

Plotting two CDFs against each other is a great way to visualize nonuniformity or inequality.  We often hear that around 20% of the income in the US goes to the top 1% of earners.  A properly constructed graph can tell us not only the percentage that goes to the top 1%, but also the percentage that goes to the top 2%, the top 5%, the bottom 5%, etc---all in a single glance.  Here's how we do it. Get income data from the IRS here: I chose the 2013 link in the first set of tables. Here's a screenshot:


I won't even attempt to turn this into a histogram because if I use a reasonable portion of the screen to represent most people ($0 to $200,000, say), the richest people will have to be very far off the right-hand edge of the screen. But if I squeeze the richest people onto the screen, details about most people will be squeezed into a tiny space. Turning the income axis into a CDF actually solves this problem, because the CDF will allocate screen space according to the share of income. We will be able to simultaneously see the contribution of many low-income people and that of a few high-income people. (I'm going to use "people", "returns" and "families" interchangeably rather than try to break things down to individuals vs. families.)

OK, let's do it.   In the first bin we have 2.1 million returns with no income.  So the first point on the people CDF will be 2.1 million, and the first point on the income CDF will be $0. Next, we have 10.6 million people (for 12.7 million total on the people CDF) making in the $1 to $5000 range, say $2500 on average.  So these 10.6 million people collectively make $26.5 billion.  The second point on our income CDF is therefore $0+$26.5 billion = $26.5 billion. We carry the 12.7 million total returns and $26.5 billion total income over to the next bin, and so on.  At the end of the last bin, we find 147 million returns and $9.9 trillion in total income.  Dividing each CDF by its maximum amount (and multiplying by 100 to show percentage) we get this blue curve:


We can now instantly read off the graph that the top 1% of returns have 15% of the income, the top 5% have 35%, the bottom 20% have 2%, and so on. In a perfectly equal-income society, the bottom 5% would take 5% of the income, the bottom 10% would take 10%, etc---in other words, the curve would follow a diagonal line on this graph.  The more the curve departs from the diagonal line, the more unequal the incomes.  We can measure how far the curve departs from the line and use that as a quick summary of the country's inequality---this is called the Gini coefficient.  (The Wikipedia article linked to here has a nice summary of Gini coefficients measured in different countries and different years, but you have to scroll down quite a bit.)

A few remarks for people who want to go deeper:

  • the plotting of two CDFs against each other, as in the last plot shown here, is referred to as a P-P plot.  A closely related concept is the Q-Q plot.
  • I emphasize again that the CDF and the histogram present the same information, just in a different way.  However, there is one advantage to the CDF: the data need not be binned. When making a histogram, we have to choose a bin size, and if we have few data points we need to make these bins rather wide to prevent the histogram from being merely a series of spikes. For the height histogram, for example, I generated 100 random heights and used bins 10 cm (about 4 inches) wide.  Maybe 100 data points would be better shown as a series of spikes than a histogram---but then the spikes in the middle might overlap confusingly.  The CDF solves this problem by presenting the data as a series of steps so we can see the contribution of each point without overlap.  If a CDF has very many data points you can no longer pick out individual steps but the slope of the CDF anywhere still equals the density of data points there.
  • my income numbers won't match a more complete analysis, for at least three reasons.  First, Americans need to file tax returns only if they exceed a certain income, so some low-income families may be missed in these numbers.  Second, the IRS numbers here contain only a "greater than $10 million" final bin.  I assumed an average income of $20 million in this bin, which is a very rough guess. To do a better job, economists studying inequality supplement the IRS data I downloaded with additional data on the very rich; they find that the top 1% make more like 20% of the total, so my guess was on the low side.  Finally, I made no attempt to disentangle individual income from family income as a better analysis would.



No comments:

Post a Comment