Friday, September 16, 2011

Data Crunch Quirk!

"Experts often possess more data than judgment." Good ol' Colin Powell.

I must disagree. At least in the case of Benford's Law, one man was really over-analyzing and trying to extract some judgment from his data and found connections where there should be none, in my view.

Apparently, in numerical data, the leading digit of each data element should fall into a certain distribution, where '1' is more often than '2' to appear, and '2' is more often than '3' to appear, and so on, until '9', at which point '9' should be < 5% and '1' should be approximately 30%.

Anyways, the thought intrigued me, as I never would have considered '1' a more popular number in data (regardless of units) than '9'. The math is easily explained by Wikipedia (as always) but, what really draws me is the data crunching! (Which Wikipedia cannot easily show). I decided to choose GDP per Capita of nearly all the countries in the world (some are not available--North Korea is an obvious example) which is readily available from Mathematica's data, and wapow! Generated several noteworthy graphs:

This code was comparatively easier than that of my former pursuits (Dandelion Cellular Automata, GDP per Capita related to Latitude, Zipf's Law, etc.) so I got creative with coloring the graphs above!

Back to what matters:

The Code

I extracted some CountryData. Using Map, I found the "GDPPerCapita" for all countries in CountryData[]. This processing took very little time, fortunately! There are 239 countries in Mathematica.

Following the data extraction of the GDP, we face the trouble of: "Missing[NotAvailable]". We weed out the little buggers with Select and, *drumroll* (this took quite a bit of trouble to find) StringFreeQ with which we also specify in parameters that we are looking for "Available" which is the string portion not present in the numerical data we look for. There are now 231 countries in our GDPPerCapita data list.

With data purification, all we need is now the lead digit. Suffice it to say that there's a bit of Flooring and a use of IntegerLength, but altogether, simple to do with Map.

Finally, I just Tally the lead digits, SortBy the First element, Map again to get the first value of each sublist, then BarChart it!

Short and Simple!

If interested in the BarChart generator:

You guessed it! Map it with another parameter for BarChart for ChartStyle -> #. Again, the # signifies the entry point for the various elements to loop through, which would be included under: ColorData["Gradients"].

Voila!

The actual Bar Graph:


The Data works out so that--in numerical order from 1-9: {0.25974, 0.160173, 0.125541, 0.142857, 0.103896, 0.0692641, 0.0519481, 0.0519481, 0.034632}. Benford's Law Holds! 1 is approx 30%, and 9 is < 5%, and the GDP per Capita's of almost all the countries in the world (minus 8 rebels) decreases appropriate to the Benford Distribution (except 4...)
A Benford Distribution should follow approx: {0.301, 0.176, 0.125, 0.0907, 0.079, 0.067, 0.058, 0.051, 0.046}