Thursday, December 2, 2010

Zipf Exploration Part II

Hi all,

Just a quick second edition to the original Zipf Exploration (below, or click the title).

I redid the code this time in order to minimize uncertainties. Specifically, I removed as many forms of punctuation as I could:

, . ! " ' ; : ? -

using a terrific combo of StringPosition, StringDrop, and Map. I got the position of all the above punctuation marks in Homer's The Iliad using StringPosition which takes two parameters, one for the foundation text, the other for the snippets to remove (you can have multiple on your "hit list," just make sure they're in String format and comma-delimited). Afterwards, use a While loop to subtract its position in the list (Java form) since when you start deleting the punctuation marks in the text, the character's spot is jerked back one. For example, the second value in your position list (I called them targets) is going to be wrong when you delete the first comma, for example. Thus, you need to subtract 1 from the second value, 2 from the third value, and so on). Then, another While loop in order to remove all the punctuation marks!

That should take care of a whole bunch of the boogers, but we still have the problem of case-sensitivity! So you gotta use ToUpperCase (or ToLowerCase if you so prefer) so you can have a very uniform word count). And the rest is basics, which I don't want to re-explain, so check out the initial Zipf Exploration post if you're interested.

BTW, here's The Iliad's breakdown:



Top-ranked words: {THE, AND, OF, TO, HE, HIS, IN, HIM, YOU, A, WITH, THAT, FOR, AS, I}

Sunday, October 24, 2010

Zipf Exploration

During my Wikipedia trawling, I found this very very cool concept made by a linguist, Mr. Zipf. He related the probability of a word's frequency as a function of its popularity ranking in any corpus of text. Think about that. I've given it a very unassuming definition, but it's ridiculous to think of how a simple discrete ranking (e.g. 1st most popular, 2nd most popular) can take into account a word's frequency of usage (e.g. 0.10823 probability of running into word X).

Just goes to show the deep roots between Numbers and Letters!

Anyways, so I was pretty psyched, so I figured to do some Mathematicking and constructed a graph and module to compare the differences between the expected frequency of a word and the actual frequency of a word in various corpora of the ages.



First, I'll go over the code!:

1) Find a plain text or html file of the whole corpus (must be large) to be tested. I used Aeschylus' The Eumenides and Stevenson's Treasure Island. I found them via OTA, the Oxford Text Archive.

2) Import into Mathematica the whole file.

3) Separate each word so they're individual elements of one list, using StringSeparate.

4) Tally the list, SortBy the Last element, then finally Reverse it to get the most frequent word on top of the list rather than at bottom, just to satiate your burning desires of which word is most frequent. I'll allow you to figure what word would be most common.

MODULE:
We want to make a ListPlot of the differences in probability of running into any word, as a function of its popularity ranking. Luckily, Mathematica's ListPlot will automatically number each entry in a uni-level list so we can just worry about getting the probabilities for both Zipf's expected values and the Author's actual values.

5) Form two empty lists, expectedValList and actualValListListPlot.

Make a Module which takes one parameter rank_.

The Module will Append the Zipf's expected probability, calculated by:




where is the usage ranking (popularity of use in the corpus) and is the number of individual words in the corpus

to the expectedValList and we Append the actual probability of running into the word:



where is the number of times the word is repeated and is the number of words in the corpus

to the actualValList.

DATA:

6) Now run a While loop for however many top words you want to compare Zipf's expectations to, and...

PLOT:

7) Now it's a matter of ListPlot!

RESULTS:

A comparison of Stevenson's frequency of word choice (red) and Zipf's expected frequency of word choice (blue). This is for Treasure Island.

Words in Rank: {the, and, a, of, I, to, was, in, had, he, that, his, with, my, as, you, for, on, it, at, we, but, not, were, me, by, have, all, said, be, this, one, from, so, out}


Similar graph for Aeschylus' The Eumenides:


Words in Rank: {the, to, of, a, in, and, I, you, my, your, this, for, with, all, his, on, CHORUS, our, who, that, LEADER, no, from, is, by, will, he, as, You, not, we, those, their, ATHENA, have}

UNCERTAINTIES:
There are quite a few discrepancies in the Aeschylus Zipf comparison, compared to Stevenson Zipf comparison. I'm assuming this is because it was translated from Greek, and thus not applicable to English, even though written in English syntax. Also, the code I've written does not account for discrepancies in punctuation, so the counts may be off, but I wouldn't believe it to significantly affect the count. The discrepancies get noticeably less with each ranking increase, percentage-wise.

Tuesday, July 6, 2010

EKG File Processing

I got an EKG file from Dr. Konopka via Dr. Choi and, with it, I was to chart the change in heart rate. The patient was administered a drug which would elevate heart rate and the EKG lasted for around 5 minutes. Anyways, the easiest way I thought of to process the file was to use an If statement nested in a For loop to check if a point passed a certain threshold value (I used the refractory period to be around the point at which I counted it as a "beat" which I determined to be anything below -800). Seeing that the SA node doesn't shoot another action potential until around 500 data points later (the EKG file was take at 500Hz, so that means, about one electric signal from SA node per second), I stayed on the safe side and incremented the i value (my For loop counter) by 400. That being done, I then incremented my count value which counts the number of beats. The rest was easy-peasy: multiply count by (60/period specified by user in param).

The EKG:


First Channel:


Second Channel:


Indeed, there is a definite positive slope, so the drug, at least for those five minutes, elevated heart rate.

By the way, the graphs above, although for two different channels, have the exact same output of: {55,57,58,60,62} in regards to BPM.

Monday, April 5, 2010

Cellular Automaton Part II

Well! It kind of worked!

My algorithm actually acted like dandelions! Heh. This is just the start.

I put a limit on how much the dandelions could cluster before I reassigned that grid box as zero (1 <= dandelion < 3). Wonderful!

Images:







PS. I understand there is some kind of function in Mathematica for making CA. By the time I found out, I had already coded most of my CA and didn't want to change so I just went along with my code. Plus, this is a much more enlightening experience than just writing: CellularAutomaton[...].

Thursday, April 1, 2010

Cellular Automaton Part I

I've been working on cellular automata and have been miserably failing.

I'm trying to do what I thought to be a "basic" automaton (I'm going to abbreviate to ca).

The basic ca runs as such:
1. Dandelions live for one step
2. Dandelions spread their seeds in the grid square right above, to the right, to the left, and below. We record this data in a list where each dandelion is any number greater than one. Spreading seeds is modeled by incrementing each "seeded" square's number by one.
3. The dandelion that spread its seeds dies by decrementing one. If there was a "cluster" of dandelions, then the decrement simply decrements the cluster down one.
4. Update and go back up to step one.

This "basic" code turned out to be much more challenging than I had thought.

I actually coded all the statements to randomly seed dandelions with 1/16 probability (get random integer from 0-15 and choose say 1 to indicate presence of dandelion). Then, I used a for loop to evaluate the inner boxes (all the boxes not on the outer edges) to avoid complications and ugly coding I really didn't want to do until I knew my algorithm as a whole worked well (which it, unfortunately, didn't). The body of the for loop was relatively simple: if the grid box it's evaluating at that moment has a value greater than or equal to one, then increment the boxes around it by one. Else, give it a value of zero, and move on.

Well, the result? The dandelions proliferated in far greater numbers than even GM dandelions on steroids would have reproduced... In essence, the grid was completely yellow. Not a speck of green. Why? Well, I didn't put a check on how much the dandelions could cluster before they died. So squares could accumulate massive numbers and no matter how many times it was decremented, it was simply incremented up at chaotic values.



And, well, the complete yellow one is just a waste of space.

<4/5/10: I actually fixed the code. Look one post after, labelled Cellular Automaton Part II>

Saturday, March 20, 2010

Graph Crank

Well, I've had a pinch of time so here's my latest in Mathematica:

I received a bunch of data from a Google Survey my teacher put up for Korean moms to look at (all users of MissyUSA who are mostly Korean women immigrants in the US.) At first, I was attempting to find very unrelated variables and graph them by hand (as in typing the actual labels of the graph, and the place to find the data, etc.) However, that took a long time and was ineffective so I made a module to crank out graphs in 2D ListPlot form for numerical data points. My algorithm is as such:

Before the Module:
Since I'm using Google Surveys, I extract the data from the website published to import the "FullData".
After that's imported, we have to remove the question titles (but I kept these handy since I needed them for the next part of my code), Flatten the data, and Partition the flattened data into whatever length each person's response to all the questions were added to 2 (in this case, there were 37 questions + 1 timestamp + 1 bullet = 39 boxes). Then, take the last 37 boxes or whatever number of boxes (by using -37)by mapping it. Afterwards, for this module, you have to Select the Integers using, well, Integers, as the filter.

In the Module:
This module is solely to make ONE graph, not several, yet. We will utilize this module to make hundreds of graphs by using a for loop.
Anyways, in the module, we have five parameters: the x axis label, the y axis label, the x-axis variable number (in the survey, what question are you seeking to put on the x-axis), the y-axis variable number, and the data set to analyze (in this case, from Google Surveys). We use a for loop to make a point with one number from the x var. number question and one from the y var. number question.
We Append that point to an empty list we declare in the beginning of the Module.
After the for loop, fit a line/curve, with obviously, Fit, to the "empty list" that had all the points appended in it in the previous for loop.

Then, Plot the line and ListPlot the data points. Use the x var. number and the y var. number with a "vs." in between as the title to show what you are comparing.

For Looping:
We use a double nested loop to traverse the questions with integer responses (which we had filtered beforehand). Set one variable to the be the counter of the x var. and the other nested to be the y var. This way, we can get ALL possible combinations. Afterwards, inside the nested for loop but before using the listPlotMaker module, put in an if statement to ensure that the x and y var. numbers are NOT the same (that way, we don't have unnecessary repeats of shockingly perfect correlations). Then, just use the listPlotMaker module inside the loop to create each graph.

ALGORITHM FIN

Anyways, that was a really long-winded explanation, but it cranked out a huge number of graphs. I can't seem to find any good correlation though...

Sunday, March 7, 2010

Quick Idea Crank

Long time no see! Anyways, I've been incredibly busy lately but I should probably record possible science fair ideas:

-Computer models of biological mechanisms (Dr. Choi was talking about tree growth)
-Using tools available to society to make very odd connections (using Mathematica to crank out the graphs, I'll try looking for any correlation)
-Using Mathematica to find an author's tendency toward a specific set of vocabulary to hopefully identify works easily and maybe even write sentences (which has been shown before on Wolfram Demonstrations).

I'm out, but I'll update later if I think or receive any more interesting science fair ideas (or interesting project ideas at all).

Tuesday, February 9, 2010

School Life Survey 1 Wrap Up

School life survey 1:

Question: Are the claims of parents that social networking is harming today's generation true? How about the assumption that more "friends" really means more friends?

Hypotheses:
1. If more time is spent on social networking sites, then there will be more emphasis on social standing. (This hypothesis was just Null--no proof for or against since there was not enough data and I was attempting to use a Boolean-style approach, where Social Standing was 1 and Academic Standing 0. Came out like 2 straight lines.)
2. If more time is spent on social networking sites, then there will be less trustworthy friends.
3. If there are few friends on social networking sites, then there will be more hours spent on social networking sites.
4. If more time is spent on social networking sites, then there will be a lower weighted GPA.

These were hypotheses added on later (I made the above hypotheses on the spot as I posted my School Life Survey 1):
5. If there are more friends then there will be a lower GPA.
6. If there are more trustworthy friends then there will be less FB friends.

Data:

Hypothesis 1: Null.

Hypothesis 2:

Hypothesis 3:

Hypothesis 4:

Hypothesis 5:
(FB Friends to GPA)

(Trustworthy Friends vs. GPA)

Hypothesis 6:


Conclusion:
The data given suggests a strong negative correlation between the time spent on social networking sites and GPA. Also, there is a very slight negative correlation between the number of trustworthy friends/facebook friends and GPA. Otherwise, the data is too sparse to draw anything even semi-definitive.

Limitations/Uncertainties:
This data was taken largely from students from Mission San Jose High School in Fremont, CA. One or two points each come from Illinois and the East Coast (I do not know which points, obviously, as this was an anonymous survey, but my peers from those parts said they took the survey). Mission has a very high emphasis on academia, which leads to the relatively high amount of people who said they appreciated academic standing over social standing. Data is here: http://spreadsheets.google.com/pub?key=trtorCi2YDWnRugPcM3tkdg&single=true&gid=0&output=html

Wednesday, February 3, 2010

PEtS

Last week, Andrew Song and I expended a sizable amount of our beauty sleep time to design five web pages and write a competition paper all for Exploravision. I named it PEtS or Photosynthetic Ethanol Synthetes, and it was on how to make ethanol fuels viable by internal cellulosic degradation in the plant itself (which I had previously alluded to in my paper on Ethanol fuels (The Biofuel Fantasy).

The Idea in Development:

My idea was based on computer programmer logic :). I was thinking about how to make plants pretty much degrade themselves for the betterment of mankind, and I got into thinking about 'if' loops after I saw this graph that linked the proteins of Arabidopsis thaliana so cleanly with outer stimuli (it shows the signal cascade with light as stimulus).

Further Research & Finish up:

After my epiphany, I set to work to find a protein which could easily link to cellulose degradation at the end of the plant's life (because for obvious reasons the plant cannot degrade its cellulosic support while trying to develop into a plant). After one of the longest lapses in brain function I've ever experienced, I realized I could link the cellulose degradation to the Flowering Locus T Protein (which is transcribed and translated near the end of the plant's life when it will flower). Afterwards, it was a relatively simple idea: get the promoter code of the Flowering Locus T Protein using the corresponding sticky ends, insert before the cellulase degrading gene egl (from the bacterium inside a cow's rumen--this transgenetic modification of plants with egl has been done before in tobacco plants already as described by Kawazu, Sun, Shibatu, et al. in their article "Expression of a bacterial endoglucanase gene in tobacco increases digestibility of its cell wall fibers" for Journal of Bioscience and Bioengineering), then insert the modified gene into a "tamed" Agrobacterium tumefaciens with a specific antibiotic-resistant gene either to penicillin or kanmycin (Agrobacterium tumefaciens is commonly used to insert new genes in Arabidopsis thaliana), infect Arabidopsis thaliana protoplasts, and finally, select the modified protoplasts with the antibiotic the modified protoplasts have resistance to. When the Flowering Locus T Protein is transcribed, the egl gene is also transcribed (it has the Flowering Locus T Protein promoter right before it as well), so as the plant flowers, it degrades its own cellulosic material. This will theoretically obviate the cellulose degrading step in ethanol production, moving ethanol fuels one step further toward sustainability.

That was pretty much the bulk of the project. In the submission itself, I covered the technology thus far, the history of ethanol fuels, and what is necessary in order for PEtS to succeed, but otherwise, I summarized my idea quite thoroughly above.

NOTE: 2/26/2010 Andrew and I received an Honorable Mention for this project. I think it was impressive considering I spent a little less than 24 hours on the project (less than some of my school projects!). The title links to the Honorable Mentions list.

Monday, February 1, 2010

The Future Atlantes Part II

I recently posted "The Future Atlantes Part I" and promised a new installment. Well, here it is and it took a great deal of thinking and hair-pulling and teeth-gnashing:

The idea:

Make a 3D graph of the cities (x, y) with their population (z) shown as sticks scaled by 450,000 (the maximum population of the cities under 10 meter elevation was around 12,000,000). This way, it can easily show visual learners which parts of the world have the most people in danger of having to emigrate from their home place due to global warming (either with fair-warning or with a rough storm which inundates the whole place). This dilemma poses another question: how will all these displaced persons find a home or money (who wants a home stuck under the sea)?

My seemingly simple algorithm:

I'll start assuming you know what I did for Part I.
I made some extra rules of coordinates of cities to their corresponding populations. Then, I used the points of the cities under 10 meter elevation to be my x and y coordinates and the population to be my z coordinate (scaled down by 450,000). (This all had to be filtered as well against the not available data). Then, I graphed it with Graphics3D[].

A little more in-depth:

I created one module which took about 3 lines of code to convert a 2D point (the coordinates of a city) to a 3D line(which includes its scaled population and goes straight down: similar to the Filling option of ListPlot). That one module probably did the bulk of the work (well, with Map's help as well).

(Hopefully, I can show full functionality with a video on Wolfram Demonstrations or YouTube, as recommended by Dr. Choi).

The Graph without Labels and Scaled by 450,000:


Graph with Labels and Scaled by 100,000:


Future:

Scale the populations not to a stick length but to a colored point or maybe colored tubes.

Sunday, January 31, 2010

The Future Atlantes Part I

I made a module to plot locations of where, around the world, there would be the greatest probability of inundation with an onset of rising sea levels.

My simple algorithm:

Select all cities which have Coordinates and Elevation; use those to create a rule with the following format: Coordinates->Elevation (in this part of the code, I highly suggest that you reverse the coordinate points since it's given in (latitude, longitude) or (y, x) when you need (x, y)). Then, create a module which accepts the rule as its one and only parameter.

In the module:
Make the argument and make it a list. Then, using an if statement, assign an empty list to itself except Appended the cities with a sea level of under 10 meters. Else, do Null.

Out of the module:
Map the module with the rules of all cities' Coordinates->Elevation. Then, plot the points on top of the plotted world map (I recommend using the "Polygon" property of CountryData).

The Map:



Credits: CityData[], CountryData[], Map[], Graphics[], FreeQ.

Part II: When I get some more time, I'll post up the final, stunning update!

Wednesday, January 27, 2010

School Life Survey 1

This is a survey I made in order to see if there is any validity in what many adults claim to be correlations in teenagers' social lives. I don't want to show my hypothesis here, but I'll save a page right now of my hypothesis. By the way, there are several questions but I'm going to use them in pairs to find correlations. Also, this is a completely anonymous survey so please respond with all honesty.



Thanks.

Monday, January 25, 2010

The Biofuel Fantasy

The title above links to my final research paper I wrote for a biotech class I took at a nearby community college. It explains the improbability of ethanol as a biofuel. However, nowadays, biofuels are much more probable from what I've read recently, from jatropha to sugar cane; corn is a bygone impossibility since it is a complete loss of food and, well, all that is in the paper.

It's a bit on the long side and it is definitely dry. Enjoy. :)

Global Warming

This is a response to Andrew's "Research: Global Warming Hoax?" from his blog: http://www.path-of-a-songbird.blogspot.com/

Hi Andrew,

Your rhetoric was impressive but I found the opinion you held of global warming alarming to say the least. (And this post is in no way meant to attack you, but rather to challenge your mindset on global warming).

Global warming is a positive feedback loop.
Every degree the world goes up--or even a fraction of a degree--so does the water. This in itself is not so worrisome since, well, a few centimeters can't hurt... But the bigger problem is that ice reflects much sunlight which gives our planet a relatively hospitable clime. Unfortunately, when we lose some ice, then we lose some of that reflecting surface area, which earth's bodies of water are forced to absorb. At first, water temps don't rise that much because of its high specific heat, but with more heating than normal, its temperature still rises, which then causes more ice to melt due to the combined factors of warmer water and the greenhouse effect (I recognize that the greenhouse effect is necessary for human survival, but humans have augmented its deteriorating effects with sulfur dioxide--a byproduct of gasoline combustion--and other forms of pollution like nitrous oxide).

From "HowStuffWorks":
In 1995 the Intergovernmental Panel on Climate Change issued a report which contained various projections of the sea level change by the year 2100. They estimate that the sea will rise 50 centimeters (20 inches) with the lowest estimates at 15 centimeters (6 inches) and the highest at 95 centimeters (37 inches). The rise will come from thermal expansion of the ocean and from melting glaciers and ice sheets. Twenty inches is no small amount -- it could have a big effect on coastal cities, especially during storms.

Also, according to my data, which was much less than yours (I used the WeatherData for 1935 not 1940), out of 20 stations, 19 showed definitive warming and only 1 showed a sort of gradual decline. (BTW, I also used 3650 running average instead of 365 to smooth the graph more). Of course, all this is occurring at a slow rate, but, in relation to the geologic time scale, this is happening in warp speed, and we must act to stop global warming.


Just as Elie Wiesel warned against indifference for the global catastrophe WWII, it too applies to one of the next, global warming.

Now why did I write this? Wiesel also said, “Words can sometimes, in moments of grace, attain the quality of deeds.”
It's because I was too lazy to fight against it so I decided to write against it.

Cheers,

Will

Sunday, January 24, 2010

The Entrepreneur

I've just finished reading "Sure Thing" by Malcolm Gladwell in The New Yorker about entrepreneurs. Completely redefines "entrepreneurial." An entrepreneur, one thinks, would be a gambler, one who takes enormous risks and, with great charisma, pulls off a stunt we'd never have dared to even think about. This is almost never the case, however. Ted Turner, the television mogul who started NBC, seemed like a nonchalant guy who threw money into whatever he wanted, but in actuality, he took the safest bets, and, the one true defining trait of an entrepreneur, had the mindset of a predator. Predators are not gamblers, they like the safest, most successful route, even at a lower profit than by gambling. Sociologists Hongwei Xu and Martin Ruef took a large sample of entrepreneurs and non-entrepreneurs and asked them to choose between: a) a business with a potential profit of 5 million dollars with 20% chance success, b) profit of 2 million dollars with 50% chance success and c) a profit of only 1.25 million but 80% chance of success. Entrepreneurs, the study suggested, were more likely to take the third, safest, lowest potential profit choice. Why? They're predators and "a bird in the hand is better than two in the bush."

Earthy Explorations

My teacher a few weeks back talked about one Economist writer who had noted a positive correlation of latitude to GDP per Capita, so I decided to put the idea to the test (Mathematica has functions which can give data--the most recent, up-to-date data--on all countries who will give it. I hypothesized the same, because when I though about it: Africa and South America suffered much the same at the hands of colonizers; their wealth of resources made them prime targets for the European, resource-depleted powers (in the 19th and 20th century).

My algorithm:

Make a Module assigned to a Function to do this first part, where the variable is a country.

Take the first part of all of the country's coordinates (the latitude). Average it, then take the absolute value (it's quite important to take the absolute value after the summation, since otherwise, places right on the equator, instead of positive and negative latitudes negating each other, would keep adding to each other and give the wrong average latitude). Then, after getting their average, get the country's GDP per Capita. Finally, create for the country its point, {Latitude, GDP per Capita}.

Map all the countries with CountryData[] to the Function, which will take a wee bit long (I had to run it when I went to sleep).

ListPlot[] the points and use Tooltip[] to label each and every point.

Data:


Conclusion: There's definitely some correlation, which supports my previous hypothesis that Latitude and GDP per Capita would correlate. There are a few outliers, obviously, like Greenland which is at the very pole, but has a relatively low GDP per Capita, and Singapore which is near the equator but has a high GDP per Capita.

Credits: Map[], ListPlot[], Tooltip[], Module[], CountryData[]

Language Comparisons

Recently have been comparing languages:

-avg. word length
-longest word
-number of words
-plot of how many words (x) have certain length (y)

My algorithm:

I first assigned a module to a function which took two arguments: the two compared languages.

In the module:
Using DictionaryLookup[], I got all the words (which Mathematica has) in each of the two languages. Inside, I created another function which made rules where it got the length of each word in the language and associated that number with the word (you'll soon see the whole point of that function). Then, I made it print data for the number of words, the average word length, the longest words, and then a plot of words to the length of each word of both languages.

Out of the module:
Using the module I created, I could compare any two languages of Arabic, BrazilianPortuguese, Breton, BritishEnglish, Catalan, Croatian, Danish, Dutch, English, Esperanto, Faroese, Finnish, French, Galician, German, Hebrew, Hindi, Hungarian, IrishGaelic, Italian, Latin, Polish, Portuguese, Russian, ScottishGaelic, Spanish, Swedish simply by typing:

languageComparisons["Portuguese", "Spanish"]

Isn't it tantalizing?

Wolfram Mathematica

I'm assuming that this blog will be devoted primarily to my traverse through Mathematica. I'll be posting up my code and it'll be assumed unless stated otherwise that it's Wolfram Mathematica.