Lab 6: Sampling distributions and the Central Limit Theorem
In this lab, you will explore the Central Limit Theorem, a result in probability theory that is that basis for most of the statistical inference techniques that we will study. The Central Limit Theorem states that if n is large and
X1, X2, ..., Xn are independent and identically distributed (i.i.d.) random variables with expected value μ and standard deviation σ, then the distribution of the mean of these random variables has approximately a normal distribution,
with mean μ and standard deviation σ/√n. You will observe the Central Limit Theorem both by simulating random variables and by taking a sample from a real population. You should also gain experience in this lab in understanding when you should expect distributions to be normal.
Begin this lab by opening up a blank Minitab worksheet. You will not look at data until a bit later on.
Normal Probability Plots
Normal probability plots are useful for determining whether a distribution is approximately normal. To see how they work, first take a sample of size 100 from a normal distribution by going to Calc --> Random Data --> Normal. Generate 100 rows of data and store the results in, say, C1. Make a histogram, and you should see approximately a bell-shaped curve, although it will not be perfectly bell-shaped because it is a relatively small sample. If you wish, you can get a normal curve superimposed on your histogram by drawing a histogram "With Fit" rather than a "Simple" histogram.
In Minitab Express, you get this sample by going to Data --> Generate Random Data. Enter 100 for the number of rows in each column, and leave the distribution as "Normal", then click "OK". To get a normal curve superimposed on the histogram, click inside the histogram, then click the plus sign to the right of the histogram and check the box for "Distribution Fit".
Now make a normal probability plot. Go to Graph --> Probability Plot, then click "OK" and select the appropriate variable to graph. There will be a point on the graph for each of the 100 simulated values. The x-coordinate of the point is the value itself. The y-coordinate is the percentile (which will be, say, 30 if 30 percent of the 100 numbers are below the number in question), but plotted on a scale so that the graph should look like a straight line if the data come from a normal distribution. Next try simulating 100 values from the exponential distribution by going Calc --> Random Data --> Exponential. Again make a histogram and a normal probability plot. This time the histogram should not look bell-shaped, and the normal probability plot should not look straight. Try the same think for a uniform distribution, from which you can simulate using Calc --> Random Data --> Uniform. Finally, to see how things look when the data are closer to normal but not quite normal, simulate 500 values from the chi-square distribution (don't worry that you don't know yet what this is) by going to Calc --> Random Data --> Chi-Square. Enter 15 for the Degrees of Freedom. Then plot a histogram (you should see a little skewness) and a normal probability plot. The curvature in the normal probability plot will likely be rather apparent, perhaps easier to see than the skewness in the histogram.
In Minitab Express, you make the normal probability plot by going to Graphs --> Probability Plot, then selecting "Simple" under "Single Y Variable". Then select the appropriate variable to graph and click "OK". You can perform the simulations by going to Data --> Generate Random Data and selecting the appropriate distribution (either Exponential, Uniform, or Chi-Square).
The normal probability plot is very helpful for deciding whether data come from a distribution that is approximately normal. A good way to check for normality is to see whether the normal probability plot looks straight. If it does not, then it is typically easiest to look at a histogram to determine in what way the data differ from normality.
You do not need to submit any of the graphs that you made in this section, but you will need to understand how to interpret normal probability plots to do the rest of the lab.
Observing the Central Limit Theorem by simulation
You have already taken a sample of 100 values from the exponential distribution and examined a histogram. Of course, it did not look like a normal distribution, which should not have been a surprise because the numbers came from an exponential distribution, not a normal distribution. However, the Central Limit Theorem tells us that averages of large numbers of random variables from any distribution should be approximately normally distributed. Here you will investigate this by simulating averages of exponentially distributed random variables.
First take 100 samples of size 2 from the exponential distribution. To do this, go to Calc --> Random Data --> Exponential as before, but generate only 2 rows of data. Also, give a range of 100 columns in which to store the data by typing C1-C100. To find the means of your 100 samples, go to Stat --> Basic Statistics --> Store Descriptive Statistics, type C1-C100 in the "Variables" box (if those were the columns you used), then click on the "Statistics" box and make sure only the "Mean" box is checked, then click "OK" twice. Minitab will store the sample means in different columns, probably C101-C200. To put them in one column, go to Data --> Transpose Columns, type in the column numbers of the columns in which the means are stored, and then click on the bubble "after last column in use". Now your 100 sample means are in a column (probably C202) and can be treated as a variable.
In Minitab Express, the easiest way to do this is to go to Data --> Generate Random Data. Generate 2 columns of data with 100 rows in each column. This means you will think of each row, rather than each column, as a sample of size 2. Select the Exponential distribution, and leave the scale as 1 and the threshold as 0. Then to obtain the 100 sample means, go to Statistics --> Summary Statistics --> Row Statistics, highlight both columns that you created and double click to get them into the box "Store row statistics for the following columns". Then under "Statistics:", make sure the Mean is checked but nothing else. Then click "OK". You should get a column containing your 100 sample means.
- Provide a histogram and a normal probability plot of the 100 sample means. Is the distribution of the sample mean approximately normal? Is it closer to being bell-shaped than the exponential distribution itself?
- Repeat the above steps with 100 samples of size 5 and 100 samples of size 30. How does the shape of the distribution of your 100 sample means change as the sample size gets larger? Relate your conclusions to the Central Limit Theorem. Provide some plots to support your conclusions.
Unfortunately, it is inconvenient to obtain the means of the 100 samples of size 30 in Minitab Express, because the Statistics --> Summary Statistics --> Row Statistics command will only handle up to 12 columns of data. To get around this problem, you can first average columns 1-10, then average columns 11-20, then average columns 21-30. Finally, take the average of the three columns of averages, which will give you the same result as if you had simply averaged columns 1-30 in one step.
The sampling distribution of a sample mean
Now open the data set MSNBC, which is available in TritonEd.
We have data on the length of visits to the msnbc.com web site. Each number in the first column is an estimate of the total number of pages seen by a user during the visit to the site.
These data were obtained from the Journal of Statistics Education Data Archive. The data set available there (which itself is just a subset of the data used for the full study) contains 50,000 observations, but we will work with just 2000. For the purposes of this activity, we will pretend that these 2000 observations constitute the entire population of visits to the web site, rather than just a sample.
We will investigate what happens when we take samples from this "population". Let's first take a sample of size 10. To do this, go to Calc --> Random Data --> Sample from Columns. Type that you will sample 10 rows, select the variable "Pages" to sample from, and store your data in, for example, column C4. (Note: you will be sampling without replacement, but this does not matter much because the population size is so much larger than the sample size.) Then take two more samples of size 10.
- Examine the data on the 2000 web site visits. What is the maximum value (the largest number of web pages visited)? What are the mean and standard deviation? (You will think of these numbers as a population mean and a population standard deviation throughout this activity). How would you describe the shape of the distribution? Is it approximately normal?
In Minitab Express, you get your samples by going to Data --> Sample from Columns. Select the variable Pages, input 10 for the number of rows in each sample, and click "OK".
To get a better sense of the sampling distribution of the sample mean, we will examine the means of 100 samples of size 10. Because this is tedious to do by hand, the means of 97 samples of size 10 have already been provided for you in the column labeled "Mean10". The first three cells in this column are left blank. Put the three sample means that you found in Question 4 into those three cells by hand. You will then have a column consisting of the means of 100 samples of size 10.
- What are the sample means of your three samples? How do these compare to the population mean?
Next, instead of samples of size 10, we will work with samples of size 60. Start by taking three samples of size 60. Put the means of these three samples into the first three cells of the column labeled "Mean60". The means of 97 other samples of size 60 are already provided in this column, so you will end up with a column consisting of the means of 100 samples of size 60.
- Consider your 100 sample means. What are the mean and standard deviation of these numbers?
- How do these numbers that you calculated in the previous question compare to what you would theoretically expect? (Hint: because you know the population mean and standard deviation, you should be able to compute the expected value and standard deviation of the mean of a sample of size 10.)
- Is the distribution of the 100 sample means approximately normal? If not, is it closer to normal than the distribution of the original population? Examine both a histogram and a normal probability plot of the data to arrive at your answer. Make sure to provide the plots with your write-up.
- What are the sample means of the three samples you took?
- Now consider all 100 sample means. What are the mean and standard deviation of these numbers?
- How do the mean and standard deviation of your 100 sample means compare to what you would theoretically expect?
- Do the sample means of the samples of size 60 tend to be closer to the population mean or further from the population mean than the sample means of the samples of size 10? Explain your answer. You may wish to examine side-by-side boxplots of the means of the samples of size 10 and the means of the samples of size 60.
- This time, is the distribution of the 100 sample means approximately normal? Explain your answer, and discuss in a few sentences how your answer relates to the Central Limit Theorem.
Deciding when a distribution should be approximately normal
By now, you should have a good idea of when you should expect a distribution to be approximately normal as a result of the Central Limit Theorem.
Recall the following rules of thumb that we have learned in class:
Recall also that individual observations could have many different distributions (such as binomial, geometric, Poisson, exponential, uniform, or normal). If we make a histogram of data that come from, say, an exponential distribution, then the histogram will show an exponential shape, not a normal distribution, regardless of how many observations there are. Only averages (and therefore sums) of large numbers of observations are guaranteed to have approximately a normal distribution by the Central Limit Theorem.
- The sum or average of n i.i.d. random variables should have approximately a normal distribution if n ≥ 30.
- A binomial distribution can be well approximated by a normal distribution if np ≥ 10 and n(1-p) ≥ 10. The same rule applies to a sample proportion.
For each of the questions below, a histogram is described. Indicate in each case whether you think the histogram should look like approximately a bell-shaped (normal) curve, and give a brief explanation why (one sentence is probably sufficient). There are no data for these questions, so you will not need to use the computer to answer these questions.
- A police department records the number of 911 calls made each day of the year, and the 365 values are plotted in a histogram.
- The day before an election, fifty different polling organizations each sample 500 people and record the percentage who say they will vote for the Democratic candidate. The 50 values are plotted in a histogram.
- The fifty polling organizations also record the average age of the 500 people in their sample, and the 50 averages are plotted in a histogram.
- One hundred batteries are tested, and the lifetimes of the batteries are plotted in a histogram.
- Two hundred students in a statistics class each flip a coin 50 times and record the number of heads. The numbers of heads are plotted in a histogram.
- Two hundred students in a statistics each roll a die 40 times and record the sum of the numbers they got on the 40 rolls. They make a histogram of the 200 sums.
- One thousand randomly chosen people report their annual salaries, and these salaries are plotted in a histogram.