Lab 6: Sampling distributions and the Central Limit Theorem


In this lab, you will explore the Central Limit Theorem, a result in probability theory that is that basis for most of the statistical inference techniques that we will study. The Central Limit Theorem states that if n is large and X1, X2, ..., Xn are independent and identically distributed (i.i.d.) random variables with expected value μ and standard deviation σ, then the distribution of the mean of these random variables has approximately a normal distribution, with mean μ and standard deviation σ/n1/2. You will observe the Central Limit Theorem both by simulating random variables and by taking a sample from a real population.

You should also gain experience in this lab in understanding when you should expect distributions to be normal. We have seen that individual observations could have many different distributions (such as binomial, geometric, Poisson, exponential, uniform, or normal). Only averages (and therefore sums) of large numbers of observations are guaranteed to be approximately normal because of the Central Limit Theorem.

Begin this lab by opening up a blank MINITAB worksheet. You will not look at data until a bit later on.

Normal Probability Plots

Normal probability plots are useful for determining whether a distribution is approximately normal. To see how they work, first take a sample of size 100 from a normal distribution by going to Calc --> Random Data --> Normal. Generate 100 rows of data and store the results in, say, C1. Make a histogram, and you should see approximately a bell-shaped curve, although it will not be perfectly bell-shaped because it is a relatively small sample. If you wish, you can get a normal curve superimposed on your histogram by drawing a histogram "With Fit" rather than a "Simple" histogram.

Now make a normal probability plot. Go to Graph --> Probability Plot , then click "OK" and select the appropriate variable to graph. There will be a point on the graph for each of the 100 simulated values. The x-coordinate of the point is the value itself. The y-coordinate is the percentile (which will be, say, 30 if 30 percent of the 100 numbers are below the number in question), but plotted on a scale so that the graph should look like a straight line if the data come from a normal distribution. Next try simulating 100 values from the exponential distribution by going Calc --> Random Data --> Exponential. Again make a histogram and a normal probability plot. This time the histogram should not look bell-shaped, and the normal probability plot should not look straight. Try the same think for a uniform distribution, from which you can simulate using Calc --> Random Data --> Uniform. Finally, to see how things look when the data are closer to normal but not quite normal, simulate 500 values from the gamma distribution (don't worry that you don't know exactly what this is) by going to Calc --> Random Data --> Gamma. Enter 7 for the shape parameter and 1 for the scale parameter. Then plot a histogram (you should see a little skewness) and a normal probability plot. The curvature in the normal probability plot will likely be rather apparent, perhaps easier to see than the skewness in the histogram.

The normal probability plot is very helpful for deciding whether data come from a distribution that is approximately normal. A good way to check for normality is to see whether the normal probability plot looks straight. If it does not, then it is typically easiest to look at a histogram to determine in what way the data differ from normality.

You do not need to submit any of the graphs that you made in this section, but you will need to understand how to interpret normal probability plots to do the rest of the lab.

Observing the Central Limit Theorem by simulation

You have already taken a sample of 100 values from the exponential distribution and examined a histogram. Of course, it did not look like a normal distribution, which should not have been a surprise because the numbers came from an exponential distribution, not a normal distribution. However, the Central Limit Theorem tells us that averages of large numbers of random variables from any distribution should be approximately normally distributed. Here you will investigate this by simulating averages of exponentially distributed random variables.

First take 100 samples of size 2 from the exponential distribution. To do this, go to Calc --> Random Data --> Exponential as before, but generate only 2 rows of data. Also, give a range of 100 columns in which to store the data by typing C1-C100. To find the means of your 100 samples, go to Stat --> Basic Statistics --> Store Descriptive Statistics, type C1-C100 in the "Variables" box (if those were the columns you used), then click on the "Statistics" box and make sure only the "Mean" box is checked, then click "OK" twice. MINITAB will store the sample means in different columns, probably C101-C200. To put them in one column, go to Data --> Transpose Columns, type in the column numbers of the columns in which the means are stored, and then click on the bubble "after last column in use". Now your 100 sample means are in a column (probably C202) and can be treated as a variable.
  1. Provide a histogram and a normal probability plot of the 100 sample means. Is the distribution of the sample mean approximately normal? Is it closer to being bell-shaped than the exponential distribution itself?

  2. Repeat the above steps with samples of size 5, 25, and 50. How does the shape of the distribution of sample means change as the sample size gets larger? Relate your conclusions to the Central Limit Theorem. Provide some plots to support your conclusions. (Remember that in MINITAB you can delete columns you no longer need by highlighting the columns and going to Edit --> Delete Cells.)

Normal approximation to the binomial

At first glance, it may not appear that a binomial random variable should be approximately normal, because it is a single random variable rather than a sum or average. However, if X is the number of successes in n trials, then X = X1 + ... + Xn, where each Xi equals 1 if the ith trial is a success and 0 if the ith trial is a failure. Therefore, the Central Limit Theorem tells us that the binomial distribution should be well approximated by the normal distribution when the sample size is large.
  1. Simulate 500 random variables from a binomial distribution with n = 15 and p = .1. (To do this, go to Calc --> Random Data --> Binomial, and enter 500 for the number of rows of data to generate, 15 for the number of trials, and .1 for the event probability.) Based on your simulation, does the binomial distribution appear to be approximately normal?

  2. Repeat the above procedures with n = 40, n = 100, and n = 1000 while keeping p = .1. Comment on the validity of the normal approximation as n gets larger. Provide plots to support your conclusions.

  3. The textbook says that the normal approximation to the binomial distribution is fairly accurate when np ≥ 10 and n(1-p) ≥ 10. Based on your investigations in the previous question, does this rule of thumb seem to be reasonable?

The sampling distribution of a sample mean

To load the data set NBC, click here. We have data on the length of visits to the msnbc.com web site. Each number in the first column is an estimate of the total number of pages seen by a user during the visit to the site. These data were obtained from the Journal of Statistics Education Data Archive. The data set available there (which itself is just a subset of the data used for the full study) contains 50,000 observations, but we will work with just 2000. For the purposes of this activity, we will pretend that these 2000 observations constitute the entire population of visits to the web site, rather than just a sample.
  1. Examine the data on the 2000 web site visits. What is the maximum value (the largest number of web pages visited)? What are the mean and standard deviation? (You will think of these numbers as a population mean and a population standard deviation throughout this activity). How would you describe the shape of the distribution? Is it approximately normal?

We will investigate what happens when we take samples from this "population". Let's first consider taking samples of size 10. In the column C2, already labeled "10Samp1", take a sample from this population of size 10. To do this, go to Calc --> Random Data --> Sample from Columns. Type that you will sample 10 rows, select the variable "Pages" to sample from, and store your data in column C2. Obtain two more samples this way, storing the values in columns C3 and C4.
  1. What are the sample means of your three samples? How do these compare to the population mean?
To get a better sense of the sampling distribution of the sample mean, we will examine 100 samples of size 10. This is rather tedious to do by hand, so samples 4 through 100 are already provided for you in columns C5 through C101. Put the 100 sample means in one column by following the same steps you followed above for your 100 samples from an exponential distribution.
  1. Consider your 100 sample means. What are the mean and standard deviation of these numbers? How do these numbers compare to what you would theoretically expect? (Hint: because you know the population mean and standard deviation, you should be able to compute the expected value and standard deviation of the mean of a sample of size 10.)

  2. Is the distribution of the 100 sample means approximately normal? If not, is it closer to normal than the distribution of the original population? Examine both a histogram and a normal probability plot of the data to arrive at your answer. Make sure to provide the plots with your write-up.
Next, instead of samples of size 10, we will work with samples of size 60. Start by putting three samples of size 60 in the columns C102-C104. Additional samples are already located in columns C105-C201.
  1. What are the sample means of the three samples you took?

  2. Now consider all 100 sample means. What are the mean and standard deviation of these numbers? How do these numbers compare to what you would theoretically expect? Do the sample means of the samples of size 60 tend to be closer to the population mean than the sample means of the samples of size 10?

  3. This time, is the distribution of the 100 sample means approximately normal? Discuss your answer in the context of the Central Limit Theorem.

Deciding when a distribution should be approximately normal

By now you should have a good idea of when you should expect a distribution to be approximately normal as a result of the Central Limit Theorem. For each of the questions below, a histogram is described. Indicate in each case whether you think the histogram should look like approximately a bell-shaped (normal) curve, and give a brief explanation why (one sentence is probably sufficient). There are no data for these questions, so you will not need to use the computer to answer these questions.
  1. Two hundred students in a statistics class each flip a coin 40 times and record the number of heads. The numbers of heads are plotted in a histogram.

  2. Two hundred students in a statistics each roll a die 40 times and record the sum of the numbers they got on the 40 rolls. They make a histogram of the 200 sums.

  3. One hundred CEOs report their annual salaries, and these salaries are plotted in a histogram.

  4. A police department in a major city keeps track of the number of murders in the city each day for a year and records the 365 values in a histogram.

  5. The day before an election, fifty different polling organizations each sample 1000 people and record the percentage who say they will vote for the Democratic candidate. The 50 values are plotted in a histogram.

  6. The fifty polling organizations also record the average age of the 1000 people in their sample, and the 50 averages are plotted in a histogram.

Remember that if you discussed this assignment with anyone other than your instructor or TA, then you should add a section called "Acknowledgments" at the end of your report, indicating from whom you received help.