Math Stats Blog - Spring 2010

Dungeness Crab Growth

May 18, 2010
Leave a Comment

In this lab, you will examine the relationship between premolt and postmolr carapace size and summarize your results both numerically and graphically.

  • Begin by considering the problem of predicting the premolt size of a crab given only its postmolt size. Develop a procedure for doing this, and derive an expression for the average squared error you expect in such a prediction.

The data for this lab were collected as part of a study of the adult female Dungeness crab. Two sets of data are provided. The first consists of premolt and post molt widths of the shells of 427 female Dungeness crabs. A mixture of laboratory data and some capture-recapture data, they were obtained over three fishing seasons. The first two were in 1981 and 1982; the third, in 1992. The data is represented in five columns consisting of Premolt, the size of the carapace before molting, Postmolt, the size of the carapace after molting, Increment, postmolt – premolt, Year, Collection year (not provided for recaptured crabs), and Source, 1=lab; 0=capture-recapture. The second set of data was collected in late May, 1983, after the molting season, and consists of 362 adult female crabs of all sizes. The carapace width was recorded as well as information on whether the crab had molted in the most recent molting season or not. This data is represented in two columns, Postmolt, and Molt Classification, 1=clean carapace; 0=fouled carapace.

First thing I did was massage the data in Word so that it could be easily read and analyzed in Excel. After the data was successfully entered into Excel, I used the Regression Data Analysis to generate a graph of postmolt vs. premolt size to see if there was a visual relationship.

This graph shows a strong linear relationship between pre and postmolt shell sizes. The blue markers represent the crab data with the red showing the least-squares regression line. The points on the scatter plot are closely bunched around the regression line. This linear association is measured by the correlation coefficient which gives a unitless measure of how well scatter plot data may be fit to a line. Positive correlation coefficients indicate that above average values in one variable are generally associated with above average values in the second variable, and the same with below average and below average. Negative correlation coefficients indicate that above average values in one variable are generally associated with below average values in the other variable. The way to compute the correlation coefficient would be to let (x_{1}, y_{1})...(x_{n}, y_{n}) be the pairs of post molt and premolt sizes for all laboratory crabs. Then for \bar{x} the average postmolt size, \bar{y} the average premolt size, and SD(x) and SD(y) the corresponding standard deviations, the sample correlation coefficient r is computed as follows:

r=\frac{1}{n} \displaystyle\sum\limits_{i=0}^n \frac{x_{i}-\bar{x}}{SD(x)} * \frac{y_{i}-\bar{y}}{SD(y)}.

In this case, the Excel calculates the correlation coefficient for us and we find it to be 0.98, a good indication that there is strong linear relationship between pre and postmolt shell size that follows the equation Premolt Size=1.073*(Postmolt Size)-25.214. Given a correlation coefficient so close to 1, we may attempt to predict a crab’s premolt carapace size by plugging its postmolt carapace size into this equation.

  • Examine a subset of the data collected, say those crabs with postmolt carapace width between 147.5 and 152.5 mm. Compare the predictions of premolt size for this subset with the actual premolt size distribution of the subset. Do this for one of two other small groups of crabs.

By inserting 147.5 and 152.5 into our previous linear equation, we get expected values of 133.05 mm and 138.42 mm, thus we would expect our range of premolt sizes to fall closely within this range. In actuality, our range is from 129.8 mm to 142.5, slightly larger than expected, and indicative of more varied data. When we attempt to plot a similar regression line of just our given range, we find less than desirable results:

Correlation coefficient R^2 = 0.546 with regression line Premolt Size=1.256*(Postmolt Size)-52.334.

When attempted again for range 130 mm to 135 mm, expected value range (114.28, 119.64):

Correlation coefficient R^2 = 0.498 with regression line Premolt Size=1.213*(Postmolt Size)-43.991.

And a final time for range  for range 154 mm to 159 mm, expected value range (140.03, 145.39):

Correlation coefficient R^2 = 0.388 with regression line Premolt Size=0.994*(Postmolt Size)-12.311.

The apparent lack of linearity on a small scale would seem to indicate that while postmolt size can contribute a range of expected values, accurately predicting a crab’s premolt size is highly inexact.

  • Use your procedure to describe the premolt size distribution of the molted crabs collected immediately following the 1983 molting season. Make a histogram for the size distribution prior to the molting season of the crabs caught in 1983. Use shading to distinguish the crabs that molted from those that did not molt.

In this data, it is assumed that a captured crab with a clean shell has molted in the most previous molting period, while a shell that has had time to collect marks or barnacles, said to be “fouled,” would not have been shed in the most previous period.

I first separated the data based on shell condition. I then used Excel to generate a histogram distribution of each data set:

Because these graphs both have the same bin range, they may be visually compared, and it is quite obvious that the molted crab data is far more normally distributed, while the unmolted data is highly skewed to the right. We can compare these graphs further with a numerical analysis:

Fouled Clean
Min 95.4 116.8
Max 168 165.1
Mean 149.1099 142.1134
Median 150.6 140.6
Mode 150.6 141.4
Kurtosis 6.2893 -0.81419
Standard Deviation 11.27 11.398
Skewness -2.066 0.041
-25.2137

Posted in Uncategorized

Patterns in DNA

April 2, 2010
Leave a Comment

How do we find clusters of palindromes? How do we determine whether a cluster is just a chance occurence or a potential replication site?

  • Random Scatter: To begin, pursue the point of view that structure in the data is indicated by departures from a uniform scatter of palindromes across the DNA. To look for structure, examine the locations of the palindromes, the spacings between palindromes, and the counts of palindromes in nonoverlapping regions of the DNA. One starting place might be to see first how random scatter looks by using a computer to simulate it, then the real data can be compared to the simulated data.

The data represents the CMV palindrome locations for the 296 palindromes each at least 10 base pairs long. After importing the data into Excel, this graph was generated using the Histogram function under the Data Analysis tab. This is a graph of each palindrome’s location. When testing for linearity, the RSquared Error of the line of best fit is 0.998. This would indicate that the data is very nearly linear, and because the original data is so nearly linear, and a random sampling should also be nearly linear by nature, it follows that a relational comparison of the two graphs would indicate that the original data is also random.

  • Locations and Spacings: Use graphical methods to examine the spacings between consecutive palindromes and sums of consecutive pairs, triplets, etc., spacings. Compare what you find for the CMV DNA to what you would expect to see in a random scatter. Also, consider graphical techniques for examining the locations of the palindromes.

Distribution of palindrome spacings: This graph is a histogram of the distribution of spacings between consecutive palindromes in the sample data. It was created in excel by subtracting successive palindrome locations and then using the histogram function.

Random Spacings: This graph is a histogram of the distribution of spacings between consecutive random numbers generated between 177 and 228953 using Matematica.

There is an observable similarity in the structures of both the random and sample data, though there does seem to be a higher incidence of spacings less than 500 in the sample data as compared to the random data. However, the similarities go further to support our initial hypothesis that the sample data is departures of a uniform scatter. What follows are successive groupings of consecutive pair spacings and triple spacings for both sets of data where we can see the similarities between the two data sets continue.

Distribution of Consecutive Pair Spacings

Random Consecutive Pair Spacings:

Distribution of Consecutive Triple Spacing:

Random Consecutive Triple Spacings:

These graphs would indicate that the CMV DNA data is closely related to what we would expect to see in a random scatter though we do continue to see a higher incidence of grouping in the sample data as compared to the random scatter.

  • Counts: Use graphical displays and more formal statistical tests to investigate the counts of palindromes in various regions of the DNA. Split the DNA into nonoverlapping regions of equal length to compare the number of palindromes in an interval to the number that you would expect from uniform random scatter. The counts for shorter regions will be more variable that those for longer regions. Also consider classifying the regions according to their number of counts.

The probability model of the Poisson Distribution prcess gives the chance that there are k points in a unit interval as \frac{\lambda^k}{k!} e^{-\lambda} for k = 0, 1,… where \lambda is the rate of hits her unit area.In our case, we want to use the homogeneous Poisson process as a reference model against which to seek an excess of palindromes, and we can do this because our data fits the uniform random scatter well, as shown above.

First, we find the palindrome counts in the first 57 nonoverlapping intervals of 4000 base pairs of CMV DNA and tally the number of complementary palindromes in each segment. However, these segments only cover the first 228,000 base pairs, and so we now consider only a total of 294 palindromes.The distribution of these counts is shown in the following table:

Palindrome

Count

Number of intervals

Observed         Expected   v

0 – 2 7 6.4
3 8 7.5
4 10 9.7
5 9 10.0
6 8 8.6
7 5 6.3
8 4 4.1
9+ 6 4.5
Total 57 57

The last column gives the expected number of segments containing the specified number of palindromes as computed from the Poission distribution. The expected number of intervals with 0, 1, or 2 palindromes is 57 x the probability of 0, 1, or 2 hits in an interval = 57e^{-\lambda}[1+\lambda+\frac{\lambda^2}{2}]. However, the rate \lambda is not known. There are 294 palindromes in the 57 intervals of length 4000, so the sample rate is 5.16 per 4000 base pairs. Plugging this estimate into the calculation above yields 0.112 chance a length of 4000 base pairs contains 0, 1, or 2 palindromes. Then the approximate expected number of segments containing 0, 1, or 2 palindromes is 57 x 0.112, or 6.4.  The remaining expectations were calculated in this way.

When we compare the observed data to the expected data we use the \chi^2 distribution. With a probability of 0.98, we see that deviations as large as those observed are very likely, and so it appears that the Poisson is a reasonable initial model.

  • The Biggest Cluster: Does the interval with the greatest number of palindromes indicate a potential origin of replication? Be careful in making your intervals, for any small, but significant, deviation from random scatter, such as a tight cluster of a few palindromes, could easily go undetected if the regions examined are too large. Also, if the regions are too small, a cluster of palindromes may be split between adjacent intervals and not appear as a high-count interval.


Each of these graphs was generated using Excel by dictating the bin size in the Histogram. The first graph is the CMV DNA data grouped by a bin size of 10,000. We can see, as would be expected in a random scatter and can be seen in the same graph of the random data, when using such a large bin size, we find a relative sense of uniformity across the data. Here we can even see a similar spike in the data around 100,000 – 110,000.

Palindrome Data:

Random Data:

Now, as the lab suggests, we break the bins down into smaller intervals (5,000) so that we may better see potential spikes in the data.

Palindrome Data:

Random Data:

We can see with this representation of the data that, while the random data has two or three potential spikes, our palindrome data appears to have a very obvious spike somewhere in the region of 95,000 to 110,000. As a class, we discovered that by sliding a box of a fixed length along this range of data, we could find where the spike in data was, and better identify the region as a possible replication site. Using the Mathematica code found here (http://mth332s09.wordpress.com/11-mathematica-code-for-palindrome-simulation/) we were able to identify the larges cluster at roughly 92,000 with a count of 8 palindromes in the 500 bin range. And according to Kristi’s blog, “Further fiddling shows we can get a cluster of 8 palindromes in a smaller window of just size 350 occurring around 92,500.”

We are now interested in whether this size cluster is statistically significant. After repeated simulations for random data sets, it is clear that the possibility of a cluster size of 8 is very unlikely to have occurred by chance, and is thus statitically significant and we may consider this region as a potential replication site for further investigation.


Posted in Uncategorized

Who Plays Video Games?

February 19, 2010
Leave a Comment

The objective of this lab is to investigate the responses of the participants in the study with the intention of providing useful information about the students to the designers of the new computer labs.

  • Begin by providing an estimate for the fraction of students who played a video game in the week prior to the survey. Provide an interval estimate as well as a point estimate for this proportion.

The original data was recorded in 15 columns with entries across a row corresponding to answers of a single person. The parameters measured were Number of hours of video games played in the week prior to the survey (Time), Like to play (Like), Where play (Where), How often (Freq), Play if busy (Busy), Playing educational (Educ), Sex, Student’s age in years (Age), Computer at home (Home), Hate math (Math), Number of hours worked the week prior to the survey (Work), Own PC (Own), PC has CD-Rom (CDRom), Have email (Email), and Grade Expected (Grade).

First, I imported the data into MSWord to make sure the data would be represented correctly in an excel sheet. A random sampling of 95 students was taken from the class of 314 of whom 91 completed the survey; thus we will take our sample size to be 91. Of the 91 students that responded to the survey, 34 or  37.36% of respondents played video games in the last week and 57 (62.64%) did not. If we take our sample data as a representative of the entire class, it would be applicable to assume that 37.36% of the whole class, or about 117 students, had played video games in the week prior and that 62.64% (197) did not.

I then calculated the 95% confidence interval for the mean proportion of students who had played video games in the last week. According to the text, this measure considers “if we were to take many simple random samples over and over, where for each example sample we compute the sample average and make a confidence interval, then we expect about 95% of the 95% confidence intervals to contain the mean.” The 95% confidence interval is the mean +/- 2SD/sqrt(n) and is calculated as (0.27, 0.47) for our sample size. This would indicate that the true mean of the data would fall in this interval 95% of the time.

  • Check to see how the amount of time spent playing video games in the week prior to the survey compares to the reported frequency of play (i.e., daily, weekly, etc.). How might the fact that there was an exam in the week prior to the survey affect your previous estimates and this comparison?
Frequency Played Time (in hours)
Daily 4.44
Weekly 2.54
Monthly 0.06
Semesterly 0.04

To obtain this data table I first sorted the data within excel according to how often they reported playing video games. I then found the average amount of time (in hours) played in the week prior for each frequency. Obviously, there would be no frequency for those that reported never playing video games.

When analyzing the accuracy of the reported data we must consider any events that may have occurred within the time frame we are observing. For example, if the week prior to the survey had been a school vacation week, the responses would be much different than if it were a finals week and students had less free time. Based on the fact that there was an exam given in the class the week prior to the survey, we may consider the possibility of skewness in the data and that students may have devoted time to studying that would otherwise be spent playing video games. This may also indicate that the 95% confidence interval for the proportion of students who play video games was lower overall and that a larger proportion of students do play video games.

However, upon closer consideration, among those students who did not play video games in the last week, only 6 out of the 57 (10.5%) reported playing ‘daily’ or ‘weekly’ compared to 35 (61.4%) reporting ‘semesterly’ or gave no answer, which may be interpreted as not at all. Among those that did play, 31 out of the 34 (91.2%) said they did so ‘daily’ or ‘weekly’ with the majority (24) reporting a ‘weekly’ frequency of play. From this data we can assume that those students who did not play video games in the week before the survey were not affected by the exam however, it is less clear as to the impact the test had on those students who do usually play.

  • Consider making an interval estimate for the average amount of time spent playing video games in the week prior to the survey. Keep in mind the overall shape of the sample distribution. A simulation study may help determine the appropriateness of an interval estimate.

Using the same method as before, the 95% confidence interval for the mean time spent playing video games was calculated to be (0.8469, 1.6388) hours. This interval would indicate that, with repeated random sampling, the true mean time spent playing video games in the last week would fall in this interval 95% of the time. This interval, however, is rather large and not specific.

The following graph was generated using the Histogram function in Excel and is a visual representation of the reported time spent playing video games in the last week.

A simulation study can be helpful in determining if our sample data is an accurate representation of the whole population and what we should expect were we to take repeated trials of selecting 91 students from a population of 314. First, we bootstrap out the data to create a sample size of the whole class. This table is from the book and was created by multiplying each sample value by a ratio of 314/91 to get the proper sample size.

Time Count Bootstrap Population
0 57 197
0.1 1 3
0.5 5 17
1 5 17
1.5 1 4
2 14 48
3 3 11
4 1 3
5 1 4
14 2 7
30 1 3
Total 91 314

This graph is a distribution of 1,000,000 trials of choosing 91 samples from our bootstrapped 314 population.

1,000,000 Trials of choosing 91 from 314
Skewness = 0.307812
Kurtosis = 2.8323
Mean = 1.21231
Standard Deviation = 0.318951

For our data to be normally distributed we would expect a skewness of 0 and kurtosis near 3. It is clear from the numerical analysis of our simulation and bootstrapping that the data is not normally distributed, but we could accept the slight variation and observe a new confidence interval. The 95% confidence interval of the mean time spent playing video games per week is (1.1789, 1.2458) hours. It is pertinent to point out that these values fall within the original confidence interval, but are more precise, which would indicate that they are a better estimate.

  • Next consider the “attitude” questions. In general, do you think the students enjoy playing video games? If you had to make a short list of the most important reasons why students like (or dislike) video games, what would you put on the list? Don’t forget that those students who say that they have never played a video game or do not at all like to play video games are asked to skip over some of these questions. So, there may be many nonresponses to the questions as to whether they think video games are educational, where they play video games, etc.

When asked if they liked to play, 69 out of the 91 students (75.8%) that responded to the question indicated that they like to play video games “very much” or “somewhat” with 23 out of the 69 (33%) reporting “very much” (25.3% overall). These statistics would lead me to the conclusion that the students do enjoy playing video games. Students may like to splay video games because they are stress-relieving, relaxing, fun, or they enjoy the competition. Reasons for dislike could be that video games are time consuming, too difficult or hard to play, or that they are simply not interested.

  • Look for differences between those who like to play video games and those who do not. To do this, use the questions in the last part of the survey, and make comparisons between male and female students, those who work for pay and those who do not, those who own a computer and those who do not, or those who expect A’s in the class and those who do not. Graphical displays and cross-tabulations are particularly helpful in making these kinds of comparisons. Also, you may want to collapse the range of responses to a question down to two or three possibilities before making these comparisons.

Male vs Female: 38 of the 91 responses were female (41.76%), 53 male. Females accounted for 28.5 hours or 25% of the total time played in the last week, a significant departure from what would be expected if males and females played the same amount. Males averaged about 1.6 hours in the last week, while females averaged 0.75 hrs. Thus, one could conclude that males, on average, spend twice as much time as females playing video games. However, among those students who reported playing in the last week, females played an average of 3.2 hrs and males an average of 3.38 hrs. This would indicate that players of both sexes play roughly the same amount when they are active video gamers, though males are more likely to play than females. When it comes to frequency, males also seem to play more often with 53.8% reporting playing daily or weekly compared to only 23.7% of females reporting daily or weekly.

Work vs Not: Of those students polled, 88 gave a response to whether they worked or not within the last week with the distribution of workers and non-workers being split 50-50. Overall, the average number of hours worked by the students in the week prior to the survey was 7.4 hrs with an average of 14.7 hrs for those that worked and 20 out of the 44 working 15 or more hours per week. The average number of hours spent playing video games for the workers was 1.08 versus 1.45 hours for those who did not work. Among those working 15 or more hours per week, the average pay time was 1.1 hours. This would appear to indicate that the amount of time spent working in the last week had no bearing on how much time was spent playing. However, there is one outlier, a student who reported working 35 hrs the week prior and also played 14 hours of video games. When we remove this outlier, the average play-time among the workers drops to 0.78 hrs with an average of 0.42 hrs among those who worked 15 or more hours.


Posted in Uncategorized

Maternal Smoking and Infant Health

February 5, 2010
Leave a Comment

What is the difference in weight between babies born to mothers who smoked during pregnancy and those who did not? Is this difference important to the health of the baby?

  • Summarize numerically the two distributions of birth weight for babies born to women who smoked during their pregnancy and for babies born to women who did not smoke during their pregnancy.

Numerical Data

Smokers NonSmokers
Min BWT 58 55
Max BWT 163 176
Mean 114.1095 123.0472
Median 115 123
Lower Quartile 102 113
Upper Quartile 126 134
Standard Deviation 18.09895 17.39869
Skewness -0.0337 -0.18736
Kurtosis 0.00408 1.05221

The original data was in an unsorted list with birth weights given in ounces and a corresponding boolean value of “0″ for “not now” , “1″ for “yes now” , or “9″ for unknown as the reported smoking status of the mother.

First, I imported the data into Excel and then sorted out the data according to the smoking status of the mother. I disregarded any responses of “9″ and then separated the two sets of data into columns and sorted by ascending birth weight. I then generated the numerical data using statistical functions within Excel.

From this numerical analysis it appears that the data for the NonSmoking mothers is slightly skewed to lower birth weights. However, the kurtosis measures indicate that the data for the Smokers is closer to normally distributed and that the NonSmokers data is more peaked around the mean. It is also clear to see that the NonSmokers had a higher mean birth weight than the Smokers.

  • Use graphical methods to compare the two distributions of birth weight. If you make separate plots for smokers and nonsmokers, be sure to scale the axes identically for both graphs.

These histograms were created in Excel with the bin size adjusted to be the same for each graph for easier comparison. Visually, both graphs appear to be normally distributed, though from the kurtosis measure we know that they are not. These graphs make it easier to see the slight skewness in the Smokers data.

  • Compare the frequency, or incidence, of low-birth-weight babies for the two groups. How reliable do you think your estimates are? That is, how would the incidence of low birth weight change if a few more or fewer babies were classified as low birth weight?

First, we must consider how to interpret the claim that “Babies born at term that weigh under 5.5 pounds are considered small for their gestational age.” If we take this definition to mean strictly less than 5.5 pounds (88 ounces), then the frequency of low-birth-weight babies for the Non smokers is 2.96% and 7.44% for the Smokers. This difference is already significant, however if we expand our definition to be inclusive and claim that those babies born at or under 5.5 pounds are low-weight, then the incidence of low-birth-weight babies rises less than 0.2% for the NonSmokers to be 3.10%, but the incidence among the Smokers jumps nearly a whole percentage point within our data, to 8.26%. This occurence would give more strength to an argument that Smokers are more likely to have low-birth-weight babies.

  • Assess the importance of the differences you found in your three types of comparisions (numerical, graphical, incidence).

With the axes properly scaled, it appears that those mothers who smoked had babies with a higher frequency of low-birth weight. However, given that the data for the NonSmoking mothers was 50% larger than that of the smokers, the incidence would be more clear if the sample sizes had been the same size. Also, the visual representations in which both data sets appeared to be nearly normal, was proven to be false and there was far more clustering around the mean for the NonSmokers. Lastly, the skewness of the Smokers data becomes even more obvious under scrutiny of the incidence of low birth weight which was much less obvious under numerical analysis.


Posted in Uncategorized

About author

I am a current resident of Westport, MA and senior math major at the University of Massachusetts Dartmouth. While my history with the UMD math department and CSUMS has been brief, it has been no less rewarding or exciting. I came to UMD as a business major in the Fall of 2006, promptly switched majors to Civil Engineering for half a second before finally settling on Mathematics as of Fall 2008. Thus, I had a lot of catching up to do both in my rusty math skills and familiarity with the department. I came upon the CSUMS program through a chance meeting with a fellow female mathematician here at school at the end of the last school year and ever since it’s been a whirlwind of decisions, expectations, and uncertainty. My project, as it currently stands, will consist of gathering and data mining information to create a graphical representation of a terrorist network. I then plan on applying different methods of analysis to extract useful information about the human network that the graph represents.

Search

Navigation

Categories:

Links:

Archives:

Feeds

Follow

Get every new post delivered to your Inbox.