Friday, September 16, 2011

Data Crunch Quirk!

"Experts often possess more data than judgment." Good ol' Colin Powell.

I must disagree. At least in the case of Benford's Law, one man was really over-analyzing and trying to extract some judgment from his data and found connections where there should be none, in my view.

Apparently, in numerical data, the leading digit of each data element should fall into a certain distribution, where '1' is more often than '2' to appear, and '2' is more often than '3' to appear, and so on, until '9', at which point '9' should be < 5% and '1' should be approximately 30%.

Anyways, the thought intrigued me, as I never would have considered '1' a more popular number in data (regardless of units) than '9'. The math is easily explained by Wikipedia (as always) but, what really draws me is the data crunching! (Which Wikipedia cannot easily show). I decided to choose GDP per Capita of nearly all the countries in the world (some are not available--North Korea is an obvious example) which is readily available from Mathematica's data, and wapow! Generated several noteworthy graphs:

This code was comparatively easier than that of my former pursuits (Dandelion Cellular Automata, GDP per Capita related to Latitude, Zipf's Law, etc.) so I got creative with coloring the graphs above!

Back to what matters:

The Code

I extracted some CountryData. Using Map, I found the "GDPPerCapita" for all countries in CountryData[]. This processing took very little time, fortunately! There are 239 countries in Mathematica.

Following the data extraction of the GDP, we face the trouble of: "Missing[NotAvailable]". We weed out the little buggers with Select and, *drumroll* (this took quite a bit of trouble to find) StringFreeQ with which we also specify in parameters that we are looking for "Available" which is the string portion not present in the numerical data we look for. There are now 231 countries in our GDPPerCapita data list.

With data purification, all we need is now the lead digit. Suffice it to say that there's a bit of Flooring and a use of IntegerLength, but altogether, simple to do with Map.

Finally, I just Tally the lead digits, SortBy the First element, Map again to get the first value of each sublist, then BarChart it!

Short and Simple!

If interested in the BarChart generator:

You guessed it! Map it with another parameter for BarChart for ChartStyle -> #. Again, the # signifies the entry point for the various elements to loop through, which would be included under: ColorData["Gradients"].


The actual Bar Graph:

The Data works out so that--in numerical order from 1-9: {0.25974, 0.160173, 0.125541, 0.142857, 0.103896, 0.0692641, 0.0519481, 0.0519481, 0.034632}. Benford's Law Holds! 1 is approx 30%, and 9 is < 5%, and the GDP per Capita's of almost all the countries in the world (minus 8 rebels) decreases appropriate to the Benford Distribution (except 4...)
A Benford Distribution should follow approx: {0.301, 0.176, 0.125, 0.0907, 0.079, 0.067, 0.058, 0.051, 0.046}

Friday, August 12, 2011


I'm going to try and code the genetic code of SCNN1A gene (found on Mathematica's [code]GenomeData[/code]). Generally, the accepted method is by translating various types of bonds into rests, the codons into notes, and structures into dynamics. It's pretty weird, but I found some interesting articles on it and how various scientists did it. Wish me luck!

Friday, July 8, 2011

Sentence Variation Affecting Essay Quality?

It's much too often that you hear "Sentence Variation!" from well-meaning teachers. It's often very helpful too, but has it any real use? Or is it just that sentence variation works well in good writing?

I decided to check it out, and let's just say that the results are ambiguous as of yet. I'm not exactly your Galileo type so I didn't do any exhaustive measurements, but just enough to exhaust my time. First off, locate some essays of high caliber (objectively) and low caliber (objectively).

My sources: NY Times for good, Bookrags (one of those essay-selling websites for naughty under-performing students).

So I whittled down the process to a few lines of code:

1) Copy and Paste the NY Times (good) essay into a cell and put quotes around it, remove all extraneous quotes inside the corpus.

2) Assign the essay to essayNYTimes

3) Now you must StringSplit it among {".", "!", "?"} and then Map that to another StringSplit among commas (to split the split sentences into words).

Now we're past the tedium.

4) Map the sentences you've gotten (the list with lists of words in each sentences, divided up among sentences) to Length

5) ListPlot it with the Joined function being True.

6) OPTIONAL Find the Mean of the words/sentences. Do the above for various corpora. Results:

I also did means, and the NY Times articles came out with about 15 words/sentence, whereas BookRags essays came out to well over 20 words/sentence. It's most likely due to the greater range of sentence lengths (I perused each essay and the poor essays seemed to have pompous sentences that hardly ever ended. Pithiness is golden.)

Saturday, April 9, 2011

GDP and Latitude!

Apparently, there's some correlation between Latitude and GDP. I found out about this recently, and I don't think I've ever posted my results up here (potentially since the results were disheartening).

Anyways, it wasn't hard at all. It's a lot of data-pulling though: CountryData, CityData to get the results.


Make a Module so you can repeat this process several times for all the countries--I love how computers can be such assiduous workers. Macs are by far more reliable though, compared to PC's that can get corrupted or burn out easily. Anyways, I confess I digress; moving on!

In the Module, you'll want to access the "CapitalCity" and "Latitude" of the given country, as well as the "GDPPerCapita" of the desired country.

Following that, make a List with 'x' as "Latitude" and 'y' as "GDPPerCapita".

Now, you might also want to use Tooltip so that you can identify each point on a ListPlot by simply hovering your cursor over it! Fantastic stuff!

Run the Module through several times for all countries, filter (using Select to take only the countries with two valid numerical data in their coordinates).

As always, I didn't reveal my whole code, but just the underpinnings and if you really want to know the code I used, email me! I believe it's somewhere on my Blogger Profile.

The data isn't so exciting, huh. It's kind of parabolic, but still disheartening, in both results and the fact that the great majority of the world are of Nixon's "Silent Majority"--unable to raise their voices, oppressed, and stuck in a cycle of poverty. Most countries (more than half) are below even 15000 GDP Per Capita.

Monday, January 31, 2011

The Great Return!

Hey all,

I'm back from a hiatus, working on Image Processing (SR 205) under James Choi again, hoping to get a shot at a week-long internship/volunteer program under Dr. Konopka, neurologist, in Chicago during the summer. This will, again, be a record of my endeavors, the first of which will be a map-solving algorithm, which Wolfram blog introduced, and which I want to develop into simpler code for many mazes. We'll see where I end up, and if I fail or find something more interesting, I might just move on!


Thursday, December 2, 2010

Zipf Exploration Part II

Hi all,

Just a quick second edition to the original Zipf Exploration (below, or click the title).

I redid the code this time in order to minimize uncertainties. Specifically, I removed as many forms of punctuation as I could:

, . ! " ' ; : ? -

using a terrific combo of StringPosition, StringDrop, and Map. I got the position of all the above punctuation marks in Homer's The Iliad using StringPosition which takes two parameters, one for the foundation text, the other for the snippets to remove (you can have multiple on your "hit list," just make sure they're in String format and comma-delimited). Afterwards, use a While loop to subtract its position in the list (Java form) since when you start deleting the punctuation marks in the text, the character's spot is jerked back one. For example, the second value in your position list (I called them targets) is going to be wrong when you delete the first comma, for example. Thus, you need to subtract 1 from the second value, 2 from the third value, and so on). Then, another While loop in order to remove all the punctuation marks!

That should take care of a whole bunch of the boogers, but we still have the problem of case-sensitivity! So you gotta use ToUpperCase (or ToLowerCase if you so prefer) so you can have a very uniform word count). And the rest is basics, which I don't want to re-explain, so check out the initial Zipf Exploration post if you're interested.

BTW, here's The Iliad's breakdown:

Top-ranked words: {THE, AND, OF, TO, HE, HIS, IN, HIM, YOU, A, WITH, THAT, FOR, AS, I}

Sunday, October 24, 2010

Zipf Exploration

During my Wikipedia trawling, I found this very very cool concept made by a linguist, Mr. Zipf. He related the probability of a word's frequency as a function of its popularity ranking in any corpus of text. Think about that. I've given it a very unassuming definition, but it's ridiculous to think of how a simple discrete ranking (e.g. 1st most popular, 2nd most popular) can take into account a word's frequency of usage (e.g. 0.10823 probability of running into word X).

Just goes to show the deep roots between Numbers and Letters!

Anyways, so I was pretty psyched, so I figured to do some Mathematicking and constructed a graph and module to compare the differences between the expected frequency of a word and the actual frequency of a word in various corpora of the ages.

First, I'll go over the code!:

1) Find a plain text or html file of the whole corpus (must be large) to be tested. I used Aeschylus' The Eumenides and Stevenson's Treasure Island. I found them via OTA, the Oxford Text Archive.

2) Import into Mathematica the whole file.

3) Separate each word so they're individual elements of one list, using StringSeparate.

4) Tally the list, SortBy the Last element, then finally Reverse it to get the most frequent word on top of the list rather than at bottom, just to satiate your burning desires of which word is most frequent. I'll allow you to figure what word would be most common.

We want to make a ListPlot of the differences in probability of running into any word, as a function of its popularity ranking. Luckily, Mathematica's ListPlot will automatically number each entry in a uni-level list so we can just worry about getting the probabilities for both Zipf's expected values and the Author's actual values.

5) Form two empty lists, expectedValList and actualValListListPlot.

Make a Module which takes one parameter rank_.

The Module will Append the Zipf's expected probability, calculated by:

where is the usage ranking (popularity of use in the corpus) and is the number of individual words in the corpus

to the expectedValList and we Append the actual probability of running into the word:

where is the number of times the word is repeated and is the number of words in the corpus

to the actualValList.


6) Now run a While loop for however many top words you want to compare Zipf's expectations to, and...


7) Now it's a matter of ListPlot!


A comparison of Stevenson's frequency of word choice (red) and Zipf's expected frequency of word choice (blue). This is for Treasure Island.

Words in Rank: {the, and, a, of, I, to, was, in, had, he, that, his, with, my, as, you, for, on, it, at, we, but, not, were, me, by, have, all, said, be, this, one, from, so, out}

Similar graph for Aeschylus' The Eumenides:

Words in Rank: {the, to, of, a, in, and, I, you, my, your, this, for, with, all, his, on, CHORUS, our, who, that, LEADER, no, from, is, by, will, he, as, You, not, we, those, their, ATHENA, have}

There are quite a few discrepancies in the Aeschylus Zipf comparison, compared to Stevenson Zipf comparison. I'm assuming this is because it was translated from Greek, and thus not applicable to English, even though written in English syntax. Also, the code I've written does not account for discrepancies in punctuation, so the counts may be off, but I wouldn't believe it to significantly affect the count. The discrepancies get noticeably less with each ranking increase, percentage-wise.