Science Garden Log: Data Analysis

Showing posts with label Data Analysis. Show all posts

Friday, September 16, 2011

Data Crunch Quirk!

"Experts often possess more data than judgment." Good ol' Colin Powell.

I must disagree. At least in the case of Benford's Law, one man was really over-analyzing and trying to extract some judgment from his data and found connections where there should be none, in my view.

Apparently, in numerical data, the leading digit of each data element should fall into a certain distribution, where '1' is more often than '2' to appear, and '2' is more often than '3' to appear, and so on, until '9', at which point '9' should be < 5% and '1' should be approximately 30%.

Anyways, the thought intrigued me, as I never would have considered '1' a more popular number in data (regardless of units) than '9'. The math is easily explained by Wikipedia (as always) but, what really draws me is the data crunching! (Which Wikipedia cannot easily show). I decided to choose GDP per Capita of nearly all the countries in the world (some are not available--North Korea is an obvious example) which is readily available from Mathematica's data, and wapow! Generated several noteworthy graphs:

This code was comparatively easier than that of my former pursuits (Dandelion Cellular Automata, GDP per Capita related to Latitude, Zipf's Law, etc.) so I got creative with coloring the graphs above!

Back to what matters:

The Code

I extracted some CountryData. Using Map, I found the "GDPPerCapita" for all countries in CountryData[]. This processing took very little time, fortunately! There are 239 countries in Mathematica.

Following the data extraction of the GDP, we face the trouble of: "Missing[NotAvailable]". We weed out the little buggers with Select and, *drumroll* (this took quite a bit of trouble to find) StringFreeQ with which we also specify in parameters that we are looking for "Available" which is the string portion not present in the numerical data we look for. There are now 231 countries in our GDPPerCapita data list.

With data purification, all we need is now the lead digit. Suffice it to say that there's a bit of Flooring and a use of IntegerLength, but altogether, simple to do with Map.

Finally, I just Tally the lead digits, SortBy the First element, Map again to get the first value of each sublist, then BarChart it!

Short and Simple!

If interested in the BarChart generator:

You guessed it! Map it with another parameter for BarChart for ChartStyle -> #. Again, the # signifies the entry point for the various elements to loop through, which would be included under: ColorData["Gradients"].

Voila!

The actual Bar Graph:

The Data works out so that--in numerical order from 1-9: {0.25974, 0.160173, 0.125541, 0.142857, 0.103896, 0.0692641, 0.0519481, 0.0519481, 0.034632}. Benford's Law Holds! 1 is approx 30%, and 9 is < 5%, and the GDP per Capita's of almost all the countries in the world (minus 8 rebels) decreases appropriate to the Benford Distribution (except 4...)

A Benford Distribution should follow approx: {0.301, 0.176, 0.125, 0.0907, 0.079, 0.067, 0.058, 0.051, 0.046}

Saturday, April 9, 2011

GDP and Latitude!

Apparently, there's some correlation between Latitude and GDP. I found out about this recently, and I don't think I've ever posted my results up here (potentially since the results were disheartening).

Anyways, it wasn't hard at all. It's a lot of data-pulling though: CountryData, CityData to get the results.

Method!

Make a Module so you can repeat this process several times for all the countries--I love how computers can be such assiduous workers. Macs are by far more reliable though, compared to PC's that can get corrupted or burn out easily. Anyways, I confess I digress; moving on!

In the Module, you'll want to access the "CapitalCity" and "Latitude" of the given country, as well as the "GDPPerCapita" of the desired country.

Following that, make a List with 'x' as "Latitude" and 'y' as "GDPPerCapita".

Now, you might also want to use Tooltip so that you can identify each point on a ListPlot by simply hovering your cursor over it! Fantastic stuff!

Run the Module through several times for all countries, filter (using Select to take only the countries with two valid numerical data in their coordinates).

As always, I didn't reveal my whole code, but just the underpinnings and if you really want to know the code I used, email me! I believe it's somewhere on my Blogger Profile.

The data isn't so exciting, huh. It's kind of parabolic, but still disheartening, in both results and the fact that the great majority of the world are of Nixon's "Silent Majority"--unable to raise their voices, oppressed, and stuck in a cycle of poverty. Most countries (more than half) are below even 15000 GDP Per Capita.

Thursday, December 2, 2010

Zipf Exploration Part II

Hi all,

Just a quick second edition to the original Zipf Exploration (below, or click the title).

I redid the code this time in order to minimize uncertainties. Specifically, I removed as many forms of punctuation as I could:

, . ! " ' ; : ? -

using a terrific combo of StringPosition, StringDrop, and Map. I got the position of all the above punctuation marks in Homer's The Iliad using StringPosition which takes two parameters, one for the foundation text, the other for the snippets to remove (you can have multiple on your "hit list," just make sure they're in String format and comma-delimited). Afterwards, use a While loop to subtract its position in the list (Java form) since when you start deleting the punctuation marks in the text, the character's spot is jerked back one. For example, the second value in your position list (I called them targets) is going to be wrong when you delete the first comma, for example. Thus, you need to subtract 1 from the second value, 2 from the third value, and so on). Then, another While loop in order to remove all the punctuation marks!

That should take care of a whole bunch of the boogers, but we still have the problem of case-sensitivity! So you gotta use ToUpperCase (or ToLowerCase if you so prefer) so you can have a very uniform word count). And the rest is basics, which I don't want to re-explain, so check out the initial Zipf Exploration post if you're interested.

BTW, here's The Iliad's breakdown:

Top-ranked words: {THE, AND, OF, TO, HE, HIS, IN, HIM, YOU, A, WITH, THAT, FOR, AS, I}

Sunday, October 24, 2010

Zipf Exploration

During my Wikipedia trawling, I found this very very cool concept made by a linguist, Mr. Zipf. He related the probability of a word's frequency as a function of its popularity ranking in any corpus of text. Think about that. I've given it a very unassuming definition, but it's ridiculous to think of how a simple discrete ranking (e.g. 1st most popular, 2nd most popular) can take into account a word's frequency of usage (e.g. 0.10823 probability of running into word X).

Just goes to show the deep roots between Numbers and Letters!

Anyways, so I was pretty psyched, so I figured to do some Mathematicking and constructed a graph and module to compare the differences between the expected frequency of a word and the actual frequency of a word in various corpora of the ages.

First, I'll go over the code!:

1) Find a plain text or html file of the whole corpus (must be large) to be tested. I used Aeschylus' The Eumenides and Stevenson's Treasure Island. I found them via OTA, the Oxford Text Archive.

2) Import into Mathematica the whole file.

3) Separate each word so they're individual elements of one list, using StringSeparate.

4) Tally the list, SortBy the Last element, then finally Reverse it to get the most frequent word on top of the list rather than at bottom, just to satiate your burning desires of which word is most frequent. I'll allow you to figure what word would be most common.

MODULE:
We want to make a ListPlot of the differences in probability of running into any word, as a function of its popularity ranking. Luckily, Mathematica's ListPlot will automatically number each entry in a uni-level list so we can just worry about getting the probabilities for both Zipf's expected values and the Author's actual values.

5) Form two empty lists, expectedValList and actualValListListPlot.

Make a Module which takes one parameter rank_.

The Module will Append the Zipf's expected probability, calculated by:

$\frac{1}{r \times ln(1.78R) }$

where $\inline r$ is the usage ranking (popularity of use in the corpus) and $\inline R$ is the number of individual words in the corpus

to the expectedValList and we Append the actual probability of running into the word:

$\frac{n}{T}$

where $\inline n$ is the number of times the word is repeated and $\inline T$ is the number of words in the corpus

to the actualValList.

DATA:

6) Now run a While loop for however many top words you want to compare Zipf's expectations to, and...

PLOT:

7) Now it's a matter of ListPlot!

RESULTS:

A comparison of Stevenson's frequency of word choice (red) and Zipf's expected frequency of word choice (blue). This is for Treasure Island.

Words in Rank:

{the, and, a, of, I, to, was, in, had, he, that, his, with, my, as, you, for, on, it, at, we, but, not, were, me, by, have, all, said, be, this, one, from, so, out}

Similar graph for Aeschylus' The Eumenides:

Words in Rank:

{the, to, of, a, in, and, I, you, my, your, this, for, with, all, his, on, CHORUS, our, who, that, LEADER, no, from, is, by, will, he, as, You, not, we, those, their, ATHENA, have}

UNCERTAINTIES:
There are quite a few discrepancies in the Aeschylus Zipf comparison, compared to Stevenson Zipf comparison. I'm assuming this is because it was translated from Greek, and thus not applicable to English, even though written in English syntax. Also, the code I've written does not account for discrepancies in punctuation, so the counts may be off, but I wouldn't believe it to significantly affect the count. The discrepancies get noticeably less with each ranking increase, percentage-wise.

Tuesday, July 6, 2010

EKG File Processing

I got an EKG file from Dr. Konopka via Dr. Choi and, with it, I was to chart the change in heart rate. The patient was administered a drug which would elevate heart rate and the EKG lasted for around 5 minutes. Anyways, the easiest way I thought of to process the file was to use an If statement nested in a For loop to check if a point passed a certain threshold value (I used the refractory period to be around the point at which I counted it as a "beat" which I determined to be anything below -800). Seeing that the SA node doesn't shoot another action potential until around 500 data points later (the EKG file was take at 500Hz, so that means, about one electric signal from SA node per second), I stayed on the safe side and incremented the i value (my For loop counter) by 400. That being done, I then incremented my count value which counts the number of beats. The rest was easy-peasy: multiply count by (60/period specified by user in param).

The EKG:

First Channel:

Second Channel:

Indeed, there is a definite positive slope, so the drug, at least for those five minutes, elevated heart rate.

By the way, the graphs above, although for two different channels, have the exact same output of: {55,57,58,60,62} in regards to BPM.

Saturday, March 20, 2010

Graph Crank

Well, I've had a pinch of time so here's my latest in Mathematica:

I received a bunch of data from a Google Survey my teacher put up for Korean moms to look at (all users of MissyUSA who are mostly Korean women immigrants in the US.) At first, I was attempting to find very unrelated variables and graph them by hand (as in typing the actual labels of the graph, and the place to find the data, etc.) However, that took a long time and was ineffective so I made a module to crank out graphs in 2D ListPlot form for numerical data points. My algorithm is as such:

Before the Module:
Since I'm using Google Surveys, I extract the data from the website published to import the "FullData".
After that's imported, we have to remove the question titles (but I kept these handy since I needed them for the next part of my code), Flatten the data, and Partition the flattened data into whatever length each person's response to all the questions were added to 2 (in this case, there were 37 questions + 1 timestamp + 1 bullet = 39 boxes). Then, take the last 37 boxes or whatever number of boxes (by using -37)by mapping it. Afterwards, for this module, you have to Select the Integers using, well, Integers, as the filter.

In the Module:
This module is solely to make ONE graph, not several, yet. We will utilize this module to make hundreds of graphs by using a for loop.
Anyways, in the module, we have five parameters: the x axis label, the y axis label, the x-axis variable number (in the survey, what question are you seeking to put on the x-axis), the y-axis variable number, and the data set to analyze (in this case, from Google Surveys). We use a for loop to make a point with one number from the x var. number question and one from the y var. number question.
We Append that point to an empty list we declare in the beginning of the Module.
After the for loop, fit a line/curve, with obviously, Fit, to the "empty list" that had all the points appended in it in the previous for loop.

Then, Plot the line and ListPlot the data points. Use the x var. number and the y var. number with a "vs." in between as the title to show what you are comparing.

For Looping:
We use a double nested loop to traverse the questions with integer responses (which we had filtered beforehand). Set one variable to the be the counter of the x var. and the other nested to be the y var. This way, we can get ALL possible combinations. Afterwards, inside the nested for loop but before using the listPlotMaker module, put in an if statement to ensure that the x and y var. numbers are NOT the same (that way, we don't have unnecessary repeats of shockingly perfect correlations). Then, just use the listPlotMaker module inside the loop to create each graph.

ALGORITHM FIN

Anyways, that was a really long-winded explanation, but it cranked out a huge number of graphs. I can't seem to find any good correlation though...

Tuesday, February 9, 2010

School Life Survey 1 Wrap Up

School life survey 1:

Question: Are the claims of parents that social networking is harming today's generation true? How about the assumption that more "friends" really means more friends?

Hypotheses:
1. If more time is spent on social networking sites, then there will be more emphasis on social standing. (This hypothesis was just Null--no proof for or against since there was not enough data and I was attempting to use a Boolean-style approach, where Social Standing was 1 and Academic Standing 0. Came out like 2 straight lines.)
2. If more time is spent on social networking sites, then there will be less trustworthy friends.
3. If there are few friends on social networking sites, then there will be more hours spent on social networking sites.
4. If more time is spent on social networking sites, then there will be a lower weighted GPA.

These were hypotheses added on later (I made the above hypotheses on the spot as I posted my School Life Survey 1):
5. If there are more friends then there will be a lower GPA.
6. If there are more trustworthy friends then there will be less FB friends.

Data:

Hypothesis 1: Null.

Hypothesis 2:

Hypothesis 3:

Hypothesis 4:

Hypothesis 5:
(FB Friends to GPA)

(Trustworthy Friends vs. GPA)

Hypothesis 6:

Conclusion:
The data given suggests a strong negative correlation between the time spent on social networking sites and GPA. Also, there is a very slight negative correlation between the number of trustworthy friends/facebook friends and GPA. Otherwise, the data is too sparse to draw anything even semi-definitive.

Limitations/Uncertainties:
This data was taken largely from students from Mission San Jose High School in Fremont, CA. One or two points each come from Illinois and the East Coast (I do not know which points, obviously, as this was an anonymous survey, but my peers from those parts said they took the survey). Mission has a very high emphasis on academia, which leads to the relatively high amount of people who said they appreciated academic standing over social standing. Data is here: http://spreadsheets.google.com/pub?key=trtorCi2YDWnRugPcM3tkdg&single=true&gid=0&output=html

Monday, January 25, 2010

Global Warming

This is a response to Andrew's "Research: Global Warming Hoax?" from his blog: http://www.path-of-a-songbird.blogspot.com/

Hi Andrew,

Your rhetoric was impressive but I found the opinion you held of global warming alarming to say the least. (And this post is in no way meant to attack you, but rather to challenge your mindset on global warming).

Global warming is a positive feedback loop.
Every degree the world goes up--or even a fraction of a degree--so does the water. This in itself is not so worrisome since, well, a few centimeters can't hurt... But the bigger problem is that ice reflects much sunlight which gives our planet a relatively hospitable clime. Unfortunately, when we lose some ice, then we lose some of that reflecting surface area, which earth's bodies of water are forced to absorb. At first, water temps don't rise that much because of its high specific heat, but with more heating than normal, its temperature still rises, which then causes more ice to melt due to the combined factors of warmer water and the greenhouse effect (I recognize that the greenhouse effect is necessary for human survival, but humans have augmented its deteriorating effects with sulfur dioxide--a byproduct of gasoline combustion--and other forms of pollution like nitrous oxide).

From "HowStuffWorks":

In 1995 the Intergovernmental Panel on Climate Change issued a report which contained various projections of the sea level change by the year 2100. They estimate that the sea will rise 50 centimeters (20 inches) with the lowest estimates at 15 centimeters (6 inches) and the highest at 95 centimeters (37 inches). The rise will come from thermal expansion of the ocean and from melting glaciers and ice sheets. Twenty inches is no small amount -- it could have a big effect on coastal cities, especially during storms.

Also, according to my data, which was much less than yours (I used the WeatherData for 1935 not 1940), out of 20 stations, 19 showed definitive warming and only 1 showed a sort of gradual decline. (BTW, I also used 3650 running average instead of 365 to smooth the graph more). Of course, all this is occurring at a slow rate, but, in relation to the geologic time scale, this is happening in warp speed, and we must act to stop global warming.

Just as Elie Wiesel warned against indifference for the global catastrophe WWII, it too applies to one of the next, global warming.

Now why did I write this? Wiesel also said, “Words can sometimes, in moments of grace, attain the quality of deeds.”
It's because I was too lazy to fight against it so I decided to write against it.

Cheers,

Will

Sunday, January 24, 2010

Language Comparisons

Recently have been comparing languages:

-avg. word length
-longest word
-number of words
-plot of how many words (x) have certain length (y)

My algorithm:

I first assigned a module to a function which took two arguments: the two compared languages.

In the module:
Using DictionaryLookup[], I got all the words (which Mathematica has) in each of the two languages. Inside, I created another function which made rules where it got the length of each word in the language and associated that number with the word (you'll soon see the whole point of that function). Then, I made it print data for the number of words, the average word length, the longest words, and then a plot of words to the length of each word of both languages.

Out of the module:
Using the module I created, I could compare any two languages of

Arabic, BrazilianPortuguese, Breton, BritishEnglish, Catalan, Croatian, Danish, Dutch, English, Esperanto, Faroese, Finnish, French, Galician, German, Hebrew, Hindi, Hungarian, IrishGaelic, Italian, Latin, Polish, Portuguese, Russian, ScottishGaelic, Spanish, Swedish

simply by typing:

languageComparisons["Portuguese", "Spanish"]

Isn't it tantalizing?

Science Garden Log