In this lab, you will examine the relationship between premolt and postmolr carapace size and summarize your results both numerically and graphically.
The data for this lab were collected as part of a study of the adult female Dungeness crab. Two sets of data are provided. The first consists of premolt and post molt widths of the shells of 427 female Dungeness crabs. A mixture of laboratory data and some capture-recapture data, they were obtained over three fishing seasons. The first two were in 1981 and 1982; the third, in 1992. The data is represented in five columns consisting of Premolt, the size of the carapace before molting, Postmolt, the size of the carapace after molting, Increment, postmolt – premolt, Year, Collection year (not provided for recaptured crabs), and Source, 1=lab; 0=capture-recapture. The second set of data was collected in late May, 1983, after the molting season, and consists of 362 adult female crabs of all sizes. The carapace width was recorded as well as information on whether the crab had molted in the most recent molting season or not. This data is represented in two columns, Postmolt, and Molt Classification, 1=clean carapace; 0=fouled carapace.
First thing I did was massage the data in Word so that it could be easily read and analyzed in Excel. After the data was successfully entered into Excel, I used the Regression Data Analysis to generate a graph of postmolt vs. premolt size to see if there was a visual relationship.
This graph shows a strong linear relationship between pre and postmolt shell sizes. The blue markers represent the crab data with the red showing the least-squares regression line. The points on the scatter plot are closely bunched around the regression line. This linear association is measured by the correlation coefficient which gives a unitless measure of how well scatter plot data may be fit to a line. Positive correlation coefficients indicate that above average values in one variable are generally associated with above average values in the second variable, and the same with below average and below average. Negative correlation coefficients indicate that above average values in one variable are generally associated with below average values in the other variable. The way to compute the correlation coefficient would be to let be the pairs of post molt and premolt sizes for all laboratory crabs. Then for
the average postmolt size,
the average premolt size, and
and
the corresponding standard deviations, the sample correlation coefficient
is computed as follows:
.
In this case, the Excel calculates the correlation coefficient for us and we find it to be 0.98, a good indication that there is strong linear relationship between pre and postmolt shell size that follows the equation . Given a correlation coefficient so close to 1, we may attempt to predict a crab’s premolt carapace size by plugging its postmolt carapace size into this equation.
By inserting 147.5 and 152.5 into our previous linear equation, we get expected values of 133.05 mm and 138.42 mm, thus we would expect our range of premolt sizes to fall closely within this range. In actuality, our range is from 129.8 mm to 142.5, slightly larger than expected, and indicative of more varied data. When we attempt to plot a similar regression line of just our given range, we find less than desirable results:
Correlation coefficient with regression line
.
When attempted again for range 130 mm to 135 mm, expected value range (114.28, 119.64):
Correlation coefficient with regression line
.
And a final time for range for range 154 mm to 159 mm, expected value range (140.03, 145.39):
Correlation coefficient with regression line
.
The apparent lack of linearity on a small scale would seem to indicate that while postmolt size can contribute a range of expected values, accurately predicting a crab’s premolt size is highly inexact.
In this data, it is assumed that a captured crab with a clean shell has molted in the most previous molting period, while a shell that has had time to collect marks or barnacles, said to be “fouled,” would not have been shed in the most previous period.
I first separated the data based on shell condition. I then used Excel to generate a histogram distribution of each data set:
Because these graphs both have the same bin range, they may be visually compared, and it is quite obvious that the molted crab data is far more normally distributed, while the unmolted data is highly skewed to the right. We can compare these graphs further with a numerical analysis:
| Fouled | Clean | |
| Min | 95.4 | 116.8 |
| Max | 168 | 165.1 |
| Mean | 149.1099 | 142.1134 |
| Median | 150.6 | 140.6 |
| Mode | 150.6 | 141.4 |
| Kurtosis | 6.2893 | -0.81419 |
| Standard Deviation | 11.27 | 11.398 |
| Skewness | -2.066 | 0.041 |
| -25.2137 |
How do we find clusters of palindromes? How do we determine whether a cluster is just a chance occurence or a potential replication site?
The data represents the CMV palindrome locations for the 296 palindromes each at least 10 base pairs long. After importing the data into Excel, this graph was generated using the Histogram function under the Data Analysis tab. This is a graph of each palindrome’s location. When testing for linearity, the RSquared Error of the line of best fit is 0.998. This would indicate that the data is very nearly linear, and because the original data is so nearly linear, and a random sampling should also be nearly linear by nature, it follows that a relational comparison of the two graphs would indicate that the original data is also random.
Distribution of palindrome spacings: This graph is a histogram of the distribution of spacings between consecutive palindromes in the sample data. It was created in excel by subtracting successive palindrome locations and then using the histogram function.
Random Spacings: This graph is a histogram of the distribution of spacings between consecutive random numbers generated between 177 and 228953 using Matematica.
There is an observable similarity in the structures of both the random and sample data, though there does seem to be a higher incidence of spacings less than 500 in the sample data as compared to the random data. However, the similarities go further to support our initial hypothesis that the sample data is departures of a uniform scatter. What follows are successive groupings of consecutive pair spacings and triple spacings for both sets of data where we can see the similarities between the two data sets continue.
Distribution of Consecutive Pair Spacings
Random Consecutive Pair Spacings:
Distribution of Consecutive Triple Spacing:
Random Consecutive Triple Spacings:
These graphs would indicate that the CMV DNA data is closely related to what we would expect to see in a random scatter though we do continue to see a higher incidence of grouping in the sample data as compared to the random scatter.
The probability model of the Poisson Distribution prcess gives the chance that there are k points in a unit interval as for k = 0, 1,… where
is the rate of hits her unit area.In our case, we want to use the homogeneous Poisson process as a reference model against which to seek an excess of palindromes, and we can do this because our data fits the uniform random scatter well, as shown above.
First, we find the palindrome counts in the first 57 nonoverlapping intervals of 4000 base pairs of CMV DNA and tally the number of complementary palindromes in each segment. However, these segments only cover the first 228,000 base pairs, and so we now consider only a total of 294 palindromes.The distribution of these counts is shown in the following table:
| Palindrome
Count |
Number of intervals
Observed Expected v |
|
| 0 – 2 | 7 | 6.4 |
| 3 | 8 | 7.5 |
| 4 | 10 | 9.7 |
| 5 | 9 | 10.0 |
| 6 | 8 | 8.6 |
| 7 | 5 | 6.3 |
| 8 | 4 | 4.1 |
| 9+ | 6 | 4.5 |
| Total | 57 | 57 |
The last column gives the expected number of segments containing the specified number of palindromes as computed from the Poission distribution. The expected number of intervals with 0, 1, or 2 palindromes is 57 x the probability of 0, 1, or 2 hits in an interval = . However, the rate
is not known. There are 294 palindromes in the 57 intervals of length 4000, so the sample rate is 5.16 per 4000 base pairs. Plugging this estimate into the calculation above yields 0.112 chance a length of 4000 base pairs contains 0, 1, or 2 palindromes. Then the approximate expected number of segments containing 0, 1, or 2 palindromes is 57 x 0.112, or 6.4. The remaining expectations were calculated in this way.
When we compare the observed data to the expected data we use the distribution. With a probability of 0.98, we see that deviations as large as those observed are very likely, and so it appears that the Poisson is a reasonable initial model.
Each of these graphs was generated using Excel by dictating the bin size in the Histogram. The first graph is the CMV DNA data grouped by a bin size of 10,000. We can see, as would be expected in a random scatter and can be seen in the same graph of the random data, when using such a large bin size, we find a relative sense of uniformity across the data. Here we can even see a similar spike in the data around 100,000 – 110,000.
Random Data:
Now, as the lab suggests, we break the bins down into smaller intervals (5,000) so that we may better see potential spikes in the data.
We can see with this representation of the data that, while the random data has two or three potential spikes, our palindrome data appears to have a very obvious spike somewhere in the region of 95,000 to 110,000. As a class, we discovered that by sliding a box of a fixed length along this range of data, we could find where the spike in data was, and better identify the region as a possible replication site. Using the Mathematica code found here (http://mth332s09.wordpress.com/11-mathematica-code-for-palindrome-simulation/) we were able to identify the larges cluster at roughly 92,000 with a count of 8 palindromes in the 500 bin range. And according to Kristi’s blog, “Further fiddling shows we can get a cluster of 8 palindromes in a smaller window of just size 350 occurring around 92,500.”
We are now interested in whether this size cluster is statistically significant. After repeated simulations for random data sets, it is clear that the possibility of a cluster size of 8 is very unlikely to have occurred by chance, and is thus statitically significant and we may consider this region as a potential replication site for further investigation.
The objective of this lab is to investigate the responses of the participants in the study with the intention of providing useful information about the students to the designers of the new computer labs.
The original data was recorded in 15 columns with entries across a row corresponding to answers of a single person. The parameters measured were Number of hours of video games played in the week prior to the survey (Time), Like to play (Like), Where play (Where), How often (Freq), Play if busy (Busy), Playing educational (Educ), Sex, Student’s age in years (Age), Computer at home (Home), Hate math (Math), Number of hours worked the week prior to the survey (Work), Own PC (Own), PC has CD-Rom (CDRom), Have email (Email), and Grade Expected (Grade).
First, I imported the data into MSWord to make sure the data would be represented correctly in an excel sheet. A random sampling of 95 students was taken from the class of 314 of whom 91 completed the survey; thus we will take our sample size to be 91. Of the 91 students that responded to the survey, 34 or 37.36% of respondents played video games in the last week and 57 (62.64%) did not. If we take our sample data as a representative of the entire class, it would be applicable to assume that 37.36% of the whole class, or about 117 students, had played video games in the week prior and that 62.64% (197) did not.
I then calculated the 95% confidence interval for the mean proportion of students who had played video games in the last week. According to the text, this measure considers “if we were to take many simple random samples over and over, where for each example sample we compute the sample average and make a confidence interval, then we expect about 95% of the 95% confidence intervals to contain the mean.” The 95% confidence interval is the mean +/- 2SD/sqrt(n) and is calculated as (0.27, 0.47) for our sample size. This would indicate that the true mean of the data would fall in this interval 95% of the time.
| Frequency Played | Time (in hours) |
| Daily | 4.44 |
| Weekly | 2.54 |
| Monthly | 0.06 |
| Semesterly | 0.04 |
To obtain this data table I first sorted the data within excel according to how often they reported playing video games. I then found the average amount of time (in hours) played in the week prior for each frequency. Obviously, there would be no frequency for those that reported never playing video games.
When analyzing the accuracy of the reported data we must consider any events that may have occurred within the time frame we are observing. For example, if the week prior to the survey had been a school vacation week, the responses would be much different than if it were a finals week and students had less free time. Based on the fact that there was an exam given in the class the week prior to the survey, we may consider the possibility of skewness in the data and that students may have devoted time to studying that would otherwise be spent playing video games. This may also indicate that the 95% confidence interval for the proportion of students who play video games was lower overall and that a larger proportion of students do play video games.
However, upon closer consideration, among those students who did not play video games in the last week, only 6 out of the 57 (10.5%) reported playing ‘daily’ or ‘weekly’ compared to 35 (61.4%) reporting ‘semesterly’ or gave no answer, which may be interpreted as not at all. Among those that did play, 31 out of the 34 (91.2%) said they did so ‘daily’ or ‘weekly’ with the majority (24) reporting a ‘weekly’ frequency of play. From this data we can assume that those students who did not play video games in the week before the survey were not affected by the exam however, it is less clear as to the impact the test had on those students who do usually play.
Using the same method as before, the 95% confidence interval for the mean time spent playing video games was calculated to be (0.8469, 1.6388) hours. This interval would indicate that, with repeated random sampling, the true mean time spent playing video games in the last week would fall in this interval 95% of the time. This interval, however, is rather large and not specific.
The following graph was generated using the Histogram function in Excel and is a visual representation of the reported time spent playing video games in the last week.
A simulation study can be helpful in determining if our sample data is an accurate representation of the whole population and what we should expect were we to take repeated trials of selecting 91 students from a population of 314. First, we bootstrap out the data to create a sample size of the whole class. This table is from the book and was created by multiplying each sample value by a ratio of 314/91 to get the proper sample size.
| Time | Count | Bootstrap Population |
| 0 | 57 | 197 |
| 0.1 | 1 | 3 |
| 0.5 | 5 | 17 |
| 1 | 5 | 17 |
| 1.5 | 1 | 4 |
| 2 | 14 | 48 |
| 3 | 3 | 11 |
| 4 | 1 | 3 |
| 5 | 1 | 4 |
| 14 | 2 | 7 |
| 30 | 1 | 3 |
| Total | 91 | 314 |
This graph is a distribution of 1,000,000 trials of choosing 91 samples from our bootstrapped 314 population.
1,000,000 Trials of choosing 91 from 314
Skewness = 0.307812
Kurtosis = 2.8323
Mean = 1.21231
Standard Deviation = 0.318951
For our data to be normally distributed we would expect a skewness of 0 and kurtosis near 3. It is clear from the numerical analysis of our simulation and bootstrapping that the data is not normally distributed, but we could accept the slight variation and observe a new confidence interval. The 95% confidence interval of the mean time spent playing video games per week is (1.1789, 1.2458) hours. It is pertinent to point out that these values fall within the original confidence interval, but are more precise, which would indicate that they are a better estimate.
When asked if they liked to play, 69 out of the 91 students (75.8%) that responded to the question indicated that they like to play video games “very much” or “somewhat” with 23 out of the 69 (33%) reporting “very much” (25.3% overall). These statistics would lead me to the conclusion that the students do enjoy playing video games. Students may like to splay video games because they are stress-relieving, relaxing, fun, or they enjoy the competition. Reasons for dislike could be that video games are time consuming, too difficult or hard to play, or that they are simply not interested.
Male vs Female: 38 of the 91 responses were female (41.76%), 53 male. Females accounted for 28.5 hours or 25% of the total time played in the last week, a significant departure from what would be expected if males and females played the same amount. Males averaged about 1.6 hours in the last week, while females averaged 0.75 hrs. Thus, one could conclude that males, on average, spend twice as much time as females playing video games. However, among those students who reported playing in the last week, females played an average of 3.2 hrs and males an average of 3.38 hrs. This would indicate that players of both sexes play roughly the same amount when they are active video gamers, though males are more likely to play than females. When it comes to frequency, males also seem to play more often with 53.8% reporting playing daily or weekly compared to only 23.7% of females reporting daily or weekly.
Work vs Not: Of those students polled, 88 gave a response to whether they worked or not within the last week with the distribution of workers and non-workers being split 50-50. Overall, the average number of hours worked by the students in the week prior to the survey was 7.4 hrs with an average of 14.7 hrs for those that worked and 20 out of the 44 working 15 or more hours per week. The average number of hours spent playing video games for the workers was 1.08 versus 1.45 hours for those who did not work. Among those working 15 or more hours per week, the average pay time was 1.1 hours. This would appear to indicate that the amount of time spent working in the last week had no bearing on how much time was spent playing. However, there is one outlier, a student who reported working 35 hrs the week prior and also played 14 hours of video games. When we remove this outlier, the average play-time among the workers drops to 0.78 hrs with an average of 0.42 hrs among those who worked 15 or more hours.
What is the difference in weight between babies born to mothers who smoked during pregnancy and those who did not? Is this difference important to the health of the baby?
Numerical Data
| Smokers | NonSmokers | |
| Min BWT | 58 | 55 |
| Max BWT | 163 | 176 |
| Mean | 114.1095 | 123.0472 |
| Median | 115 | 123 |
| Lower Quartile | 102 | 113 |
| Upper Quartile | 126 | 134 |
| Standard Deviation | 18.09895 | 17.39869 |
| Skewness | -0.0337 | -0.18736 |
| Kurtosis | 0.00408 | 1.05221 |
The original data was in an unsorted list with birth weights given in ounces and a corresponding boolean value of “0″ for “not now” , “1″ for “yes now” , or “9″ for unknown as the reported smoking status of the mother.
First, I imported the data into Excel and then sorted out the data according to the smoking status of the mother. I disregarded any responses of “9″ and then separated the two sets of data into columns and sorted by ascending birth weight. I then generated the numerical data using statistical functions within Excel.
From this numerical analysis it appears that the data for the NonSmoking mothers is slightly skewed to lower birth weights. However, the kurtosis measures indicate that the data for the Smokers is closer to normally distributed and that the NonSmokers data is more peaked around the mean. It is also clear to see that the NonSmokers had a higher mean birth weight than the Smokers.
These histograms were created in Excel with the bin size adjusted to be the same for each graph for easier comparison. Visually, both graphs appear to be normally distributed, though from the kurtosis measure we know that they are not. These graphs make it easier to see the slight skewness in the Smokers data.

First, we must consider how to interpret the claim that “Babies born at term that weigh under 5.5 pounds are considered small for their gestational age.” If we take this definition to mean strictly less than 5.5 pounds (88 ounces), then the frequency of low-birth-weight babies for the Non smokers is 2.96% and 7.44% for the Smokers. This difference is already significant, however if we expand our definition to be inclusive and claim that those babies born at or under 5.5 pounds are low-weight, then the incidence of low-birth-weight babies rises less than 0.2% for the NonSmokers to be 3.10%, but the incidence among the Smokers jumps nearly a whole percentage point within our data, to 8.26%. This occurence would give more strength to an argument that Smokers are more likely to have low-birth-weight babies.
With the axes properly scaled, it appears that those mothers who smoked had babies with a higher frequency of low-birth weight. However, given that the data for the NonSmoking mothers was 50% larger than that of the smokers, the incidence would be more clear if the sample sizes had been the same size. Also, the visual representations in which both data sets appeared to be nearly normal, was proven to be false and there was far more clustering around the mean for the NonSmokers. Lastly, the skewness of the Smokers data becomes even more obvious under scrutiny of the incidence of low birth weight which was much less obvious under numerical analysis.