Friday, September 16, 2011

Data Crunch Quirk!

"Experts often possess more data than judgment." Good ol' Colin Powell.

I must disagree. At least in the case of Benford's Law, one man was really over-analyzing and trying to extract some judgment from his data and found connections where there should be none, in my view.

Apparently, in numerical data, the leading digit of each data element should fall into a certain distribution, where '1' is more often than '2' to appear, and '2' is more often than '3' to appear, and so on, until '9', at which point '9' should be < 5% and '1' should be approximately 30%.

Anyways, the thought intrigued me, as I never would have considered '1' a more popular number in data (regardless of units) than '9'. The math is easily explained by Wikipedia (as always) but, what really draws me is the data crunching! (Which Wikipedia cannot easily show). I decided to choose GDP per Capita of nearly all the countries in the world (some are not available--North Korea is an obvious example) which is readily available from Mathematica's data, and wapow! Generated several noteworthy graphs:

This code was comparatively easier than that of my former pursuits (Dandelion Cellular Automata, GDP per Capita related to Latitude, Zipf's Law, etc.) so I got creative with coloring the graphs above!

Back to what matters:

The Code

I extracted some CountryData. Using Map, I found the "GDPPerCapita" for all countries in CountryData[]. This processing took very little time, fortunately! There are 239 countries in Mathematica.

Following the data extraction of the GDP, we face the trouble of: "Missing[NotAvailable]". We weed out the little buggers with Select and, *drumroll* (this took quite a bit of trouble to find) StringFreeQ with which we also specify in parameters that we are looking for "Available" which is the string portion not present in the numerical data we look for. There are now 231 countries in our GDPPerCapita data list.

With data purification, all we need is now the lead digit. Suffice it to say that there's a bit of Flooring and a use of IntegerLength, but altogether, simple to do with Map.

Finally, I just Tally the lead digits, SortBy the First element, Map again to get the first value of each sublist, then BarChart it!

Short and Simple!

If interested in the BarChart generator:

You guessed it! Map it with another parameter for BarChart for ChartStyle -> #. Again, the # signifies the entry point for the various elements to loop through, which would be included under: ColorData["Gradients"].

Voila!

The actual Bar Graph:


The Data works out so that--in numerical order from 1-9: {0.25974, 0.160173, 0.125541, 0.142857, 0.103896, 0.0692641, 0.0519481, 0.0519481, 0.034632}. Benford's Law Holds! 1 is approx 30%, and 9 is < 5%, and the GDP per Capita's of almost all the countries in the world (minus 8 rebels) decreases appropriate to the Benford Distribution (except 4...)
A Benford Distribution should follow approx: {0.301, 0.176, 0.125, 0.0907, 0.079, 0.067, 0.058, 0.051, 0.046}

Friday, August 12, 2011

SCNN1A Code

I'm going to try and code the genetic code of SCNN1A gene (found on Mathematica's [code]GenomeData[/code]). Generally, the accepted method is by translating various types of bonds into rests, the codons into notes, and structures into dynamics. It's pretty weird, but I found some interesting articles on it and how various scientists did it. Wish me luck!

Friday, July 8, 2011

Sentence Variation Affecting Essay Quality?

It's much too often that you hear "Sentence Variation!" from well-meaning teachers. It's often very helpful too, but has it any real use? Or is it just that sentence variation works well in good writing?

I decided to check it out, and let's just say that the results are ambiguous as of yet. I'm not exactly your Galileo type so I didn't do any exhaustive measurements, but just enough to exhaust my time. First off, locate some essays of high caliber (objectively) and low caliber (objectively).

My sources: NY Times for good, Bookrags (one of those essay-selling websites for naughty under-performing students).

So I whittled down the process to a few lines of code:

1) Copy and Paste the NY Times (good) essay into a cell and put quotes around it, remove all extraneous quotes inside the corpus.

2) Assign the essay to essayNYTimes

3) Now you must StringSplit it among {".", "!", "?"} and then Map that to another StringSplit among commas (to split the split sentences into words).

Now we're past the tedium.

4) Map the sentences you've gotten (the list with lists of words in each sentences, divided up among sentences) to Length

5) ListPlot it with the Joined function being True.

6) OPTIONAL Find the Mean of the words/sentences. Do the above for various corpora. Results:


I also did means, and the NY Times articles came out with about 15 words/sentence, whereas BookRags essays came out to well over 20 words/sentence. It's most likely due to the greater range of sentence lengths (I perused each essay and the poor essays seemed to have pompous sentences that hardly ever ended. Pithiness is golden.)

Saturday, April 9, 2011

GDP and Latitude!

Apparently, there's some correlation between Latitude and GDP. I found out about this recently, and I don't think I've ever posted my results up here (potentially since the results were disheartening).

Anyways, it wasn't hard at all. It's a lot of data-pulling though: CountryData, CityData to get the results.

Method!

Make a Module so you can repeat this process several times for all the countries--I love how computers can be such assiduous workers. Macs are by far more reliable though, compared to PC's that can get corrupted or burn out easily. Anyways, I confess I digress; moving on!

In the Module, you'll want to access the "CapitalCity" and "Latitude" of the given country, as well as the "GDPPerCapita" of the desired country.

Following that, make a List with 'x' as "Latitude" and 'y' as "GDPPerCapita".

Now, you might also want to use Tooltip so that you can identify each point on a ListPlot by simply hovering your cursor over it! Fantastic stuff!

Run the Module through several times for all countries, filter (using Select to take only the countries with two valid numerical data in their coordinates).

As always, I didn't reveal my whole code, but just the underpinnings and if you really want to know the code I used, email me! I believe it's somewhere on my Blogger Profile.




The data isn't so exciting, huh. It's kind of parabolic, but still disheartening, in both results and the fact that the great majority of the world are of Nixon's "Silent Majority"--unable to raise their voices, oppressed, and stuck in a cycle of poverty. Most countries (more than half) are below even 15000 GDP Per Capita.

Monday, January 31, 2011

The Great Return!

Hey all,

I'm back from a hiatus, working on Image Processing (SR 205) under James Choi again, hoping to get a shot at a week-long internship/volunteer program under Dr. Konopka, neurologist, in Chicago during the summer. This will, again, be a record of my endeavors, the first of which will be a map-solving algorithm, which Wolfram blog introduced, and which I want to develop into simpler code for many mazes. We'll see where I end up, and if I fail or find something more interesting, I might just move on!

Vamonos!