During my Wikipedia trawling, I found this very very cool concept made by a linguist, Mr. Zipf. He related the probability of a word's frequency as a function of its popularity ranking in any corpus of text. Think about that. I've given it a very unassuming definition, but it's ridiculous to think of how a simple discrete ranking (e.g. 1st most popular, 2nd most popular) can take into account a word's frequency of usage (e.g.
0.10823 probability of running into word X
).
Just goes to show the deep roots between Numbers and Letters!
Anyways, so I was pretty psyched, so I figured to do some
Mathematicking and constructed a graph and module to compare the differences between the expected frequency of a word and the actual frequency of a word in various corpora of the ages.
First, I'll go over the
code
!:
1) Find a plain text or html file of the whole corpus (must be large) to be tested. I used Aeschylus'
The Eumenides and Stevenson's
Treasure Island. I found them via
OTA, the Oxford Text Archive.
2)
Import
into Mathematica the whole file.
3) Separate each word so they're individual elements of one list, using
StringSeparate
.
4)
Tally
the list,
SortBy
the Last element, then finally
Reverse
it to get the most frequent word on top of the list rather than at bottom, just to satiate your burning desires of which word is most frequent. I'll allow you to figure what word would be most common.
MODULE:
We want to make a
ListPlot
of the differences in probability of running into any word, as a function of its popularity ranking. Luckily, Mathematica's
ListPlot
will automatically number each entry in a uni-level list so we can just worry about getting the probabilities for both Zipf's expected values and the Author's actual values.
5) Form two empty lists,
expectedValList
and
actualValList
ListPlot
.
Make a
Module
which takes one parameter
rank_
.
The
Module
will
Append
the Zipf's expected probability, calculated by:
where
is the usage ranking (popularity of use in the corpus) and
is the number of individual words in the corpus
to the
expectedValList
and we
Append
the actual probability of running into the word:
where
is the number of times the word is repeated and
is the number of words in the corpus
to the
actualValList
.
DATA:
6) Now run a
While
loop for however many top words you want to compare Zipf's expectations to, and...
PLOT:
7) Now it's a matter of
ListPlot
!
RESULTS:
A comparison of Stevenson's frequency of word choice (
red) and Zipf's expected frequency of word choice (
blue). This is for
Treasure Island.
Words in Rank:
{the, and, a, of, I, to, was, in, had, he, that, his, with, my, as, you, for, on, it, at, we, but, not, were, me, by, have, all, said, be, this, one, from, so, out}
Similar graph for Aeschylus'
The Eumenides:
Words in Rank:
{the, to, of, a, in, and, I, you, my, your, this, for, with, all, his, on, CHORUS, our, who, that, LEADER, no, from, is, by, will, he, as, You, not, we, those, their, ATHENA, have}
UNCERTAINTIES:
There are quite a few discrepancies in the Aeschylus Zipf comparison, compared to Stevenson Zipf comparison. I'm assuming this is because it was translated from Greek, and thus not applicable to English, even though written in English syntax. Also, the code I've written does not account for discrepancies in punctuation, so the counts may be off, but I wouldn't believe it to significantly affect the count. The discrepancies get noticeably less with each ranking increase, percentage-wise.