Sunday, October 24, 2010

Zipf Exploration

During my Wikipedia trawling, I found this very very cool concept made by a linguist, Mr. Zipf. He related the probability of a word's frequency as a function of its popularity ranking in any corpus of text. Think about that. I've given it a very unassuming definition, but it's ridiculous to think of how a simple discrete ranking (e.g. 1st most popular, 2nd most popular) can take into account a word's frequency of usage (e.g. 0.10823 probability of running into word X).

Just goes to show the deep roots between Numbers and Letters!

Anyways, so I was pretty psyched, so I figured to do some Mathematicking and constructed a graph and module to compare the differences between the expected frequency of a word and the actual frequency of a word in various corpora of the ages.



First, I'll go over the code!:

1) Find a plain text or html file of the whole corpus (must be large) to be tested. I used Aeschylus' The Eumenides and Stevenson's Treasure Island. I found them via OTA, the Oxford Text Archive.

2) Import into Mathematica the whole file.

3) Separate each word so they're individual elements of one list, using StringSeparate.

4) Tally the list, SortBy the Last element, then finally Reverse it to get the most frequent word on top of the list rather than at bottom, just to satiate your burning desires of which word is most frequent. I'll allow you to figure what word would be most common.

MODULE:
We want to make a ListPlot of the differences in probability of running into any word, as a function of its popularity ranking. Luckily, Mathematica's ListPlot will automatically number each entry in a uni-level list so we can just worry about getting the probabilities for both Zipf's expected values and the Author's actual values.

5) Form two empty lists, expectedValList and actualValListListPlot.

Make a Module which takes one parameter rank_.

The Module will Append the Zipf's expected probability, calculated by:




where is the usage ranking (popularity of use in the corpus) and is the number of individual words in the corpus

to the expectedValList and we Append the actual probability of running into the word:



where is the number of times the word is repeated and is the number of words in the corpus

to the actualValList.

DATA:

6) Now run a While loop for however many top words you want to compare Zipf's expectations to, and...

PLOT:

7) Now it's a matter of ListPlot!

RESULTS:

A comparison of Stevenson's frequency of word choice (red) and Zipf's expected frequency of word choice (blue). This is for Treasure Island.

Words in Rank: {the, and, a, of, I, to, was, in, had, he, that, his, with, my, as, you, for, on, it, at, we, but, not, were, me, by, have, all, said, be, this, one, from, so, out}


Similar graph for Aeschylus' The Eumenides:


Words in Rank: {the, to, of, a, in, and, I, you, my, your, this, for, with, all, his, on, CHORUS, our, who, that, LEADER, no, from, is, by, will, he, as, You, not, we, those, their, ATHENA, have}

UNCERTAINTIES:
There are quite a few discrepancies in the Aeschylus Zipf comparison, compared to Stevenson Zipf comparison. I'm assuming this is because it was translated from Greek, and thus not applicable to English, even though written in English syntax. Also, the code I've written does not account for discrepancies in punctuation, so the counts may be off, but I wouldn't believe it to significantly affect the count. The discrepancies get noticeably less with each ranking increase, percentage-wise.