Science Garden Log: Language Comparisons

Showing posts with label Language Comparisons. Show all posts

Thursday, December 2, 2010

Zipf Exploration Part II

Hi all,

Just a quick second edition to the original Zipf Exploration (below, or click the title).

I redid the code this time in order to minimize uncertainties. Specifically, I removed as many forms of punctuation as I could:

, . ! " ' ; : ? -

using a terrific combo of StringPosition, StringDrop, and Map. I got the position of all the above punctuation marks in Homer's The Iliad using StringPosition which takes two parameters, one for the foundation text, the other for the snippets to remove (you can have multiple on your "hit list," just make sure they're in String format and comma-delimited). Afterwards, use a While loop to subtract its position in the list (Java form) since when you start deleting the punctuation marks in the text, the character's spot is jerked back one. For example, the second value in your position list (I called them targets) is going to be wrong when you delete the first comma, for example. Thus, you need to subtract 1 from the second value, 2 from the third value, and so on). Then, another While loop in order to remove all the punctuation marks!

That should take care of a whole bunch of the boogers, but we still have the problem of case-sensitivity! So you gotta use ToUpperCase (or ToLowerCase if you so prefer) so you can have a very uniform word count). And the rest is basics, which I don't want to re-explain, so check out the initial Zipf Exploration post if you're interested.

BTW, here's The Iliad's breakdown:

Top-ranked words: {THE, AND, OF, TO, HE, HIS, IN, HIM, YOU, A, WITH, THAT, FOR, AS, I}

Sunday, October 24, 2010

Zipf Exploration

During my Wikipedia trawling, I found this very very cool concept made by a linguist, Mr. Zipf. He related the probability of a word's frequency as a function of its popularity ranking in any corpus of text. Think about that. I've given it a very unassuming definition, but it's ridiculous to think of how a simple discrete ranking (e.g. 1st most popular, 2nd most popular) can take into account a word's frequency of usage (e.g. 0.10823 probability of running into word X).

Just goes to show the deep roots between Numbers and Letters!

Anyways, so I was pretty psyched, so I figured to do some Mathematicking and constructed a graph and module to compare the differences between the expected frequency of a word and the actual frequency of a word in various corpora of the ages.

First, I'll go over the code!:

1) Find a plain text or html file of the whole corpus (must be large) to be tested. I used Aeschylus' The Eumenides and Stevenson's Treasure Island. I found them via OTA, the Oxford Text Archive.

2) Import into Mathematica the whole file.

3) Separate each word so they're individual elements of one list, using StringSeparate.

4) Tally the list, SortBy the Last element, then finally Reverse it to get the most frequent word on top of the list rather than at bottom, just to satiate your burning desires of which word is most frequent. I'll allow you to figure what word would be most common.

MODULE:
We want to make a ListPlot of the differences in probability of running into any word, as a function of its popularity ranking. Luckily, Mathematica's ListPlot will automatically number each entry in a uni-level list so we can just worry about getting the probabilities for both Zipf's expected values and the Author's actual values.

5) Form two empty lists, expectedValList and actualValListListPlot.

Make a Module which takes one parameter rank_.

The Module will Append the Zipf's expected probability, calculated by:

$\frac{1}{r \times ln(1.78R) }$

where $\inline r$ is the usage ranking (popularity of use in the corpus) and $\inline R$ is the number of individual words in the corpus

to the expectedValList and we Append the actual probability of running into the word:

$\frac{n}{T}$

where $\inline n$ is the number of times the word is repeated and $\inline T$ is the number of words in the corpus

to the actualValList.

DATA:

6) Now run a While loop for however many top words you want to compare Zipf's expectations to, and...

PLOT:

7) Now it's a matter of ListPlot!

RESULTS:

A comparison of Stevenson's frequency of word choice (red) and Zipf's expected frequency of word choice (blue). This is for Treasure Island.

Words in Rank:

{the, and, a, of, I, to, was, in, had, he, that, his, with, my, as, you, for, on, it, at, we, but, not, were, me, by, have, all, said, be, this, one, from, so, out}

Similar graph for Aeschylus' The Eumenides:

Words in Rank:

{the, to, of, a, in, and, I, you, my, your, this, for, with, all, his, on, CHORUS, our, who, that, LEADER, no, from, is, by, will, he, as, You, not, we, those, their, ATHENA, have}

UNCERTAINTIES:
There are quite a few discrepancies in the Aeschylus Zipf comparison, compared to Stevenson Zipf comparison. I'm assuming this is because it was translated from Greek, and thus not applicable to English, even though written in English syntax. Also, the code I've written does not account for discrepancies in punctuation, so the counts may be off, but I wouldn't believe it to significantly affect the count. The discrepancies get noticeably less with each ranking increase, percentage-wise.

Sunday, January 24, 2010

Language Comparisons

Recently have been comparing languages:

-avg. word length
-longest word
-number of words
-plot of how many words (x) have certain length (y)

My algorithm:

I first assigned a module to a function which took two arguments: the two compared languages.

In the module:
Using DictionaryLookup[], I got all the words (which Mathematica has) in each of the two languages. Inside, I created another function which made rules where it got the length of each word in the language and associated that number with the word (you'll soon see the whole point of that function). Then, I made it print data for the number of words, the average word length, the longest words, and then a plot of words to the length of each word of both languages.

Out of the module:
Using the module I created, I could compare any two languages of

Arabic, BrazilianPortuguese, Breton, BritishEnglish, Catalan, Croatian, Danish, Dutch, English, Esperanto, Faroese, Finnish, French, Galician, German, Hebrew, Hindi, Hungarian, IrishGaelic, Italian, Latin, Polish, Portuguese, Russian, ScottishGaelic, Spanish, Swedish

simply by typing:

languageComparisons["Portuguese", "Spanish"]

Isn't it tantalizing?

Science Garden Log