Thursday, December 2, 2010

Zipf Exploration Part II

Hi all,

Just a quick second edition to the original Zipf Exploration (below, or click the title).

I redid the code this time in order to minimize uncertainties. Specifically, I removed as many forms of punctuation as I could:

, . ! " ' ; : ? -

using a terrific combo of StringPosition, StringDrop, and Map. I got the position of all the above punctuation marks in Homer's The Iliad using StringPosition which takes two parameters, one for the foundation text, the other for the snippets to remove (you can have multiple on your "hit list," just make sure they're in String format and comma-delimited). Afterwards, use a While loop to subtract its position in the list (Java form) since when you start deleting the punctuation marks in the text, the character's spot is jerked back one. For example, the second value in your position list (I called them targets) is going to be wrong when you delete the first comma, for example. Thus, you need to subtract 1 from the second value, 2 from the third value, and so on). Then, another While loop in order to remove all the punctuation marks!

That should take care of a whole bunch of the boogers, but we still have the problem of case-sensitivity! So you gotta use ToUpperCase (or ToLowerCase if you so prefer) so you can have a very uniform word count). And the rest is basics, which I don't want to re-explain, so check out the initial Zipf Exploration post if you're interested.

BTW, here's The Iliad's breakdown:



Top-ranked words: {THE, AND, OF, TO, HE, HIS, IN, HIM, YOU, A, WITH, THAT, FOR, AS, I}