Lab 6: Sampling distributions and the Central Limit Theorem

In this lab, you will explore the Central Limit Theorem, a result in probability theory that is that basis for most of the statistical inference techniques that we will study. The Central Limit Theorem states that if n is large and X1, X2, ..., Xn are independent and identically distributed (i.i.d.) random variables with expected value μ and standard deviation σ, then the distribution of the mean of these random variables has approximately a normal distribution, with mean μ and standard deviation σ/n1/2. You will observe the Central Limit Theorem both by simulating random variables and by taking a sample from a real population. You should also gain experience in this lab in understanding when you should expect distributions to be normal.

Begin this lab by opening up a blank Minitab worksheet. You will not look at data until a bit later on.

Normal Probability Plots

Normal probability plots are useful for determining whether a distribution is approximately normal. To see how they work, first take a sample of size 100 from a normal distribution by going to Calc --> Random Data --> Normal. Generate 100 rows of data and store the results in, say, C1. Make a histogram, and you should see approximately a bell-shaped curve, although it will not be perfectly bell-shaped because it is a relatively small sample. If you wish, you can get a normal curve superimposed on your histogram by drawing a histogram "With Fit" rather than a "Simple" histogram.

Now make a normal probability plot. Go to Graph --> Probability Plot , then click "OK" and select the appropriate variable to graph. There will be a point on the graph for each of the 100 simulated values. The x-coordinate of the point is the value itself. The y-coordinate is the percentile (which will be, say, 30 if 30 percent of the 100 numbers are below the number in question), but plotted on a scale so that the graph should look like a straight line if the data come from a normal distribution. Next try simulating 100 values from the exponential distribution by going Calc --> Random Data --> Exponential. Again make a histogram and a normal probability plot. This time the histogram should not look bell-shaped, and the normal probability plot should not look straight. Try the same think for a uniform distribution, from which you can simulate using Calc --> Random Data --> Uniform. Finally, to see how things look when the data are closer to normal but not quite normal, simulate 500 values from the gamma distribution (don't worry that you don't know exactly what this is) by going to Calc --> Random Data --> Gamma. Enter 7 for the shape parameter and 1 for the scale parameter. Then plot a histogram (you should see a little skewness) and a normal probability plot. The curvature in the normal probability plot will likely be rather apparent, perhaps easier to see than the skewness in the histogram.

The normal probability plot is very helpful for deciding whether data come from a distribution that is approximately normal. A good way to check for normality is to see whether the normal probability plot looks straight. If it does not, then it is typically easiest to look at a histogram to determine in what way the data differ from normality.

You do not need to submit any of the graphs that you made in this section, but you will need to understand how to interpret normal probability plots to do the rest of the lab.

Observing the Central Limit Theorem by simulation

You have already taken a sample of 100 values from the exponential distribution and examined a histogram. Of course, it did not look like a normal distribution, which should not have been a surprise because the numbers came from an exponential distribution, not a normal distribution. However, the Central Limit Theorem tells us that averages of large numbers of random variables from any distribution should be approximately normally distributed. Here you will investigate this by simulating averages of exponentially distributed random variables.

First take 100 samples of size 3 from the exponential distribution. To do this, go to Calc --> Random Data --> Exponential as before, but generate only 3 rows of data. Also, give a range of 100 columns in which to store the data by typing C1-C100. To find the means of your 100 samples, go to Stat --> Basic Statistics --> Store Descriptive Statistics, type C1-C100 in the "Variables" box (if those were the columns you used), then click on the "Statistics" box and make sure only the "Mean" box is checked, then click "OK" twice. Minitab will store the sample means in different columns, probably C101-C200. To put them in one column, go to Data --> Transpose Columns, type in the column numbers of the columns in which the means are stored, and then click on the bubble "after last column in use". Now your 100 sample means are in a column (probably C202) and can be treated as a variable.
  1. Provide a histogram and a normal probability plot of the 100 sample means. Is the distribution of the sample mean approximately normal? Is it closer to being bell-shaped than the exponential distribution itself?

  2. Repeat the above steps with 100 samples of size 6, 100 samples of size 25, and 100 samples of size 50. How does the shape of the distribution of your 100 sample means change as the sample size gets larger? Relate your conclusions to the Central Limit Theorem. Provide some plots to support your conclusions. (Remember that in Minitab you can delete columns you no longer need by highlighting the columns and going to Edit --> Delete Cells.)

Normal approximation to the binomial

At first glance, it may not appear that a binomial random variable should be approximately normal, because it is a single random variable rather than a sum or average. However, if X is the number of successes in n trials, then X = X1 + ... + Xn, where each Xi equals 1 if the ith trial is a success and 0 if the ith trial is a failure. Therefore, the Central Limit Theorem tells us that the binomial distribution should be well approximated by the normal distribution when the sample size is large.
  1. Simulate 400 random variables from a binomial distribution with n = 20 and p = .1. (To do this, go to Calc --> Random Data --> Binomial, and enter 400 for the number of rows of data to generate, 20 for the number of trials, and .1 for the event probability.) Based on your simulation, does the binomial distribution appear to be approximately normal?

  2. Repeat the above procedures with n = 40, n = 100, and n = 800 while keeping p = .1. Comment on the validity of the normal approximation as n gets larger. Provide plots to support your conclusions.

  3. Our textbook says that the normal approximation to the binomial distribution is fairly accurate when np ≥ 10 and n(1-p) ≥ 10. Based on your investigations in the previous question, does this rule of thumb seem to be reasonable?

The sampling distribution of a sample mean

Now open the data set MSNBC, which is available in TED.

We have data on the length of visits to the msnbc.com web site. Each number in the first column is an estimate of the total number of pages seen by a user during the visit to the site. These data were obtained from the Journal of Statistics Education Data Archive. The data set available there (which itself is just a subset of the data used for the full study) contains 50,000 observations, but we will work with just 2000. For the purposes of this activity, we will pretend that these 2000 observations constitute the entire population of visits to the web site, rather than just a sample.
  1. Examine the data on the 2000 web site visits. What is the maximum value (the largest number of web pages visited)? What are the mean and standard deviation? (You will think of these numbers as a population mean and a population standard deviation throughout this activity). How would you describe the shape of the distribution? Is it approximately normal?
We will investigate what happens when we take samples from this "population". Let's first consider taking samples of size 10. In the column C2, already labeled "10Samp1", take a sample from this population of size 10. To do this, go to Calc --> Random Data --> Sample from Columns. Type that you will sample 10 rows, select the variable "Pages" to sample from, and store your data in column C2. Obtain two more samples this way, storing the values in columns C3 and C4. (Note: you will be sampling without replacement, but this does not matter much because the population size is so much larger than the sample size.)
  1. What are the sample means of your three samples? How do these compare to the population mean?
To get a better sense of the sampling distribution of the sample mean, we will examine 100 samples of size 10. This is rather tedious to do by hand, so samples 4 through 100 are already provided for you in columns C5 through C101. Put the 100 sample means in one column by following the same steps you followed above for your 100 samples from an exponential distribution.
  1. Consider your 100 sample means. What are the mean and standard deviation of these numbers?

  2. How do these numbers that you calculated in the previous question compare to what you would theoretically expect? (Hint: because you know the population mean and standard deviation, you should be able to compute the expected value and standard deviation of the mean of a sample of size 10.)

  3. Is the distribution of the 100 sample means approximately normal? If not, is it closer to normal than the distribution of the original population? Examine both a histogram and a normal probability plot of the data to arrive at your answer. Make sure to provide the plots with your write-up.
Next, instead of samples of size 10, we will work with samples of size 60. Start by putting three samples of size 60 in the columns C102-C104. Additional samples are already located in columns C105-C201.
  1. What are the sample means of the three samples you took?

  2. Now consider all 100 sample means. What are the mean and standard deviation of these numbers?

  3. How do the mean and standard deviation of your 100 sample means compare to what you would theoretically expect?

  4. Do the sample means of the samples of size 60 tend to be closer to the population mean or further from the population mean than the sample means of the samples of size 10? Explain your answer. You may wish to examine side-by-side boxplots of the means of the samples of size 10 and the means of the samples of size 60.

  5. This time, is the distribution of the 100 sample means approximately normal? Discuss your answer in the context of the Central Limit Theorem.

Deciding when a distribution should be approximately normal

By now, you should have a good idea of when you should expect a distribution to be approximately normal as a result of the Central Limit Theorem. Recall the following rules of thumb that we have learned in class: Recall also that individual observations could have many different distributions (such as binomial, geometric, Poisson, exponential, uniform, or normal). If we make a histogram of data that come from, say, an exponential distribution, then the histogram will show an exponential shape, not a normal distribution, regardless of how many observations there are. Only averages (and therefore sums) of large numbers of observations are guaranteed to have approximately a normal distribution by the Central Limit Theorem.

For each of the questions below, a histogram is described. Indicate in each case whether you think the histogram should look like approximately a bell-shaped (normal) curve, and give a brief explanation why (one sentence is probably sufficient). There are no data for these questions, so you will not need to use the computer to answer these questions.
  1. A police department records the number of 911 calls made each day of the year, and the 365 values are plotted in a histogram.

  2. The day before an election, fifty different polling organizations each sample 500 people and record the percentage who say they will vote for the Democratic candidate. The 50 values are plotted in a histogram.

  3. The fifty polling organizations also record the average age of the 500 people in their sample, and the 50 averages are plotted in a histogram.

  4. One hundred batteries are tested, and the lifetimes of the batteries are plotted in a histogram.

  5. Two hundred students in a statistics class each flip a coin 50 times and record the number of heads. The numbers of heads are plotted in a histogram.

  6. Two hundred students in a statistics each roll a die 40 times and record the sum of the numbers they got on the 40 rolls. They make a histogram of the 200 sums.

  7. One thousand randomly chosen people report their annual salaries, and these salaries are plotted in a histogram.