## Lab 5: Birth times and birth weights (Probability distributions)

In this lab, you will investigate the genders, birth times, and birth weights of babies. One goal of this lab is to help you gain experience in determining when you should expect to see different probability distributions arising in practice. In the course, we have introduced six different probability distributions: geometric, binomial, Poisson, uniform, exponential, and normal. Click here for a one-page handout reviewing these six distributions.

The Data

First, open the data set BABIES, which is available in TritonEd.

We have data on 300 babies born at Salinas Valley Memorial Healthcare System in Salinas, California during the months of January through June, 2009. The data were obtained from the WebNursery web site. Information at the Baby Name Facts web site was used to help with distinguishing male and female names. A few babies for which the gender could not be determined from the name, or other information was unavailable, were excluded from the data set. Twins were also excluded from the data set. The data set includes the following columns. The first eight columns have 300 rows and provide information about each of the 300 babies. The last two columns have 181 rows and give the number of babies born on each of the 181 days between 1/1/2009 and 6/30/2009.

 Variable Name Description Date The date the baby was born Time The time of the day that the baby was born, measured in hours after midnight Weight The baby's weight, in ounces Gender B = boy, G = girl NumtoB For boys, the number of births (including the current one) since the previous boy NumtoG For girls, the number of births (including the current one) since the previous girl NumGirls The number of the previous five babies born that were girls Interval The length of time, in hours, since the previous birth Day A day between 1/1/2009 and 6/30/2009 Number The number of births in the data set on that day

Waiting for boys or girls (Geometric Distribution)

Suppose we have independent Bernoulli trials, each resulting in success with probability p and failure with probability 1 - p. Then the number of trials that it takes to get a success has the geometric distribution with parameter p. The number of times we have to toss a coin before we get a head has the geometric distribution with p = 1/2. If a baseball player gets a hit 30 percent of the time that he bats, the number of times that he bats before getting a hit has the geometric distribution with p = 0.3. If we assume that each baby in our data set is independently a boy or a girl with probability 1/2 each, then the number of babies that are born before the next boy (or girl) should have the geometric distribution with p = 1/2.

We will first do a quick check to see if boys and girls appear to be equally likely. Go to Stat --> Tables --> Tally Individual Variables, select the variable "Gender", and click "OK". Minitab outputs how many times each value (in this case B or G) appears in the column. This command will be useful throughout the lab.

In Minitab Express, the command is Statistics --> Summary Statistics --> Tally.
1. How many boys and how many girls are there in the data set?
Now consider the variables "NumtoB" and "NumtoG". To understand how these variables were computed from the "Gender" variable, consider, for example, the eighth baby, which was a boy. The third baby was the previous boy, but then we had to wait for five more babies (numbers 4, 5, 6, 7, and 8) to get the next boy. Therefore, the NumtoB is 5.

We will investigate whether the variables "NumtoB" and "NumtoG" really do follow approximately a geometric distribution with p = 1/2. To do this, we will compare what we actually observe with what we would expect to observe if the geometric model were correct. To tally what we actually observed, go again to Stat --> Tables --> Tally Individual Variables and this time select the variables "NumtoB" and "NumtoG".
1. How many times did we have to wait for just one baby to get a boy? How many times did we have to wait for two babies to get a boy? Three? Four? Five? Six? Seven? Answer the same questions for girls. It is probably best to display your answers in a table, similar to what Minitab displays.
To figure out what we should expect, type the numbers 1, 2, ..., 7 into one column. Go to Calc --> Probability Distributions --> Geometric, click the bubble at the top that says "Probability", type the value for p in the box labeled "Event probability", enter the column in which you typed the numbers 1, 2, ..., 7 in the box "Input Column", choose another column in the box "Optional storage" to record the output, and click "OK". Notice that, for example, the first three values you get are 1/2, 1/4, and 1/8 because if X has a geometric distribution with p = 1/2, then P(X = 1) = 1/2, P(X = 2) = 1/4, and P(X = 3) = 1/8. Then to figure out how many of each value we expect, we need to multiply these numbers by the total number of boys or girls in our data set, which you can do by hand or by going to Calc --> Calculator. For example, if there were 100 boys, then we would expect to wait for one baby to get a boy 50 times, for two babies 25 times, for three babies 12.5 times, and so on.

In Minitab Express, to figure out what we should expect, you should also type the numbers 1, 2, ..., 7 into one column. Go to to Statistics --> Probability Distributions --> Probability Density Function. Under "Form of Input", select "A column of values" and then under "Values in" select the column in which you typed the numbers 1, 2, ..., 7. Then under "Distribution", select "Geometric", and type the value for p in the box labeled "Event probability". Under "Output", choose "Store probability density values in a column", and then click "OK". Recall also that in Minitab Express, you use Data --> Formula in place of Calc --> Calculator.
1. How do the numbers you observed in question 2 compare to what you would expect if these numbers followed a geometric distribution with p = 1/2? To answer this question, make a table similar to the one you made for question 2 but with the expected numbers rather than the actual numbers. Do your data roughly agree with what you expected? (If you wish to look at a graph rather than just comparing the numbers in the table, try typing the 7 observed numbers in one column and the 7 expected numbers in another column. Then go to Graph --> Bar Chart, choose the option that bars represent "Values from a table", then select "Cluster" under "Two-way table" and click OK. Choose the columns in which you placed the observed and expected numbers as "Graph variables" and put the column in which you have the numbers 1 through 7 in the "Row labels" box. Then click OK. You will get side-by-side bar charts of the observed and expected numbers.)

In Minitab Express, you would go to Graphs --> Bar Chart, then select that the values represent "Summarized values for each category in a table", then select "Clustered" under "Two-way table". Choose the columns in which you placed the observed and expected numbers as "Summary variables" and put the column in which you have the numbers 1 through 7 in the "Column of row labels" box. Then click OK.

2. You should have noticed that the NumtoB variable once took the value 10. Do you think that this unusual value means there is something wrong with the geometric model, or do you think this was just a chance event?

Numbers of boys and girls (Binomial Distribution)

Again, suppose we have independent Bernoulli trials, each resulting in success with probability p and failure with probability 1 - p. Then the number of successes in n trials has the binomial distribution with parameters n and p. The number of heads in 5 coin tosses has the binomial distribution with n = 5 and p = 1/2. If a baseball player gets a hit 30 percent of the time, the number of hits that he gets if he bats 10 times has a binomial distribution with n = 10 and p = 0.3. If we assume that each baby in our data set is independently a boy or a girl with probability 1/2 each, then if we split the babies into groups of size 5, the number of boys or girls in a group should have a binomial distribution with n = 5 and p = 1/2.

We have split the babies into groups of five and counted the number of girls in each group. The relevant numbers appear in the column "NumGirls". Make sure that you understand how these numbers were computed. For example, there is a 2 in row 5 because there were two girls among the first five babies. There is a 2 in row 10 because there were also two girls among the next five babies, and so on.
1. Record the number of groups that have zero, one, two, three, four, and five girls.

2. Compare these numbers to what you would expect if these numbers followed the binomial model. (Hint: to get Minitab to help you compute the expected numbers, type the numbers 0, 1, 2, 3, 4, 5 in one column, and use the Calc --> Probability Distributions --> Binomial command, which you are familiar with from Lab 4.) Do the data match the binomial distribution well? Again, you can either compare the numbers in your tables or make side-by-side histograms as described in question 3 above.

In Minitab Express, the relevant command is Statistics --> Probability Distributions --> Probability Density Function, and you will select the Binomial distribution.

The number of births in a day (Poisson Distribution)

Suppose an event is happening at some constant rate over a period of time. Then the number of times that the event occurs during a particular time interval has the Poisson distribution. Therefore, the number of customers who arrive in a store during a 5-minute interval should have a Poisson distribution. The number of goals scored during a soccer game should also have approximately a Poisson distribution. If babies are born at approximately a constant rate over time, then the number of babies born on a given day should have a Poisson distribution. Here you will investigate whether the Poisson distribution indeed fits the data well.
1. What is the average number of births per day?

2. On how many days were there no births? One? Two? Three? Four? Five? Six?

3. Compare these numbers to what you would expect if these numbers followed a Poisson distribution with the same mean as what you found in question 7. Do the data approximately agree with what would be expected from the Poisson distribution? (To get Minitab to help with the Poisson distribution computation, type the numbers 0, 1, ..., 6 in one column. Then go Calc --> Probability Distributions --> Poisson, click the bubble that says "Probability", type in the mean you found in question 7, and then proceed as you did for your calculations with the geometric and binomial distributions.)

In Minitab Express, the relevant command is Statistics --> Probability Distributions --> Probability Density Function, and you will select the Poisson distribution.

Intervals between births (Exponential Distribution)

Suppose an event is happening at some constant rate over a period of time. Then the amount of time that we have to wait before the next occurrence of the event has the exponential distribution. For example, the amount of time before the next customer arrives in a store should have the exponential distribution, as should the amount of time before the next goal in a soccer game. If babies are born at approximately a constant rate over time, then the length of time between births (that is, the amount of time we have to wait for the next birth) should have approximately an exponential distribution. Here you will investigate whether this is the case. Because time is a continuous variable, you can not proceed by comparing observed and expected counts as before. Instead, you will base your analysis on a histogram of the data. Below are some points to keep in mind:
• Negative times between births are, of course, impossible. However, by default, Minitab makes the first bar of the histogram extend below zero. To fix this, double click on the horizontal axis, click the "Binning" tab, click in the "Cutpoint" bubble, and click "OK". You will have to do this with several histograms in this lab, and it is very important that you do this in order for your graphs to display the data accurately. Remember you can also change the number of bins.

In Minitab Express, you will need to click inside the histogram, then click on the plus sign to the right of the graph and select "Cutpoint" under "Binning".

• A useful technique is to superimpose an exponential curve on top of the histogram. You can do this by going to Graph --> Histogram and then selecting "With Fit" and clicking "OK". You will have to select a variable as usual. Then click on the "Data View" tab, then click on "Distribution", check the "Fit distribution" box, and select "Exponential" in the "Distribution" window. Then click "OK" twice to make the graph.

In Minitab Express, you click inside the histogram, then click the plus sign to the right of the graph. Then check "Distribution fit", click the associated triangle, and select "Exponential".
1. What is the mean time between births?

Birth Times (Uniform Distribution)

A random variable has a uniform distribution if it has a density that is constant over some interval. If a bus is equally likely to arrive any time during the next three minutes, then the time (in minutes) that you will have to wait for the bus has a uniform distribution on [0, 3]. If babies were equally likely to be born at any time of the day, the distribution of the birth times would be approximately uniform.
1. Make a histogram of the birth times. Does the distribution of birth times appear to be uniform, or do you see a different pattern? Explain your answer. It would be a good idea to simulate 300 values from a uniform distribution a few times to see how much fluctuation would be expected just by chance. You can do this by going to Calc --> Random Data --> Uniform.

In Minitab Express, you go to Data --> Generate Random Data and select "Uniform" as the distribution.

Birth Weights (Normal Distribution)

The normal distribution has a density in the shape of the famous "bell curve". As you will learn later, a famous result called the Central Limit Theorem guarantees that sums and averages of many independent random variables will have approximately a normal distribution. For this reason, normal distributions are ubiquitous in nature. Although we can only be confident of observing a normal distribution when we are taking sums or averages, measurements of complex traits often have approximately a normal distribution. For example, heights and IQ scores of adults are approximately normally distributed, probably because height and IQ are in some sense "averages" of many contributing factors. Here you will investigate whether the birth weights of the babies also follows a normal distribution.
1. Graph the distribution of birth weights. Does the distribution of the birth weights look to be approximately normal? (You could answer this just from the histogram, or you could try superimposing a normal curve on top of the histogram, as you did with the exponential distribution.)

Other Examples

In this lab, you have investigated six of the most important distributions in probability theory. You should now have a good idea of when to expect these distributions to appear. For the random variables below, indicate whether you would expect the distribution to be best described as geometric, binomial, Poisson, exponential, uniform, or normal. We do not have data, so you will not to use the computer for these questions. For each item, give a brief explanation of your answer. A one-sentence explanation should be sufficient.
1. The number of days that we have to wait before the first Daily 4 number drawn in the California State Lottery is a 6. (Each day, this number is equally likely to be any of the 10 digits.)

2. The amount of time before the next plane crash in the United States.

3. The number of typographical errors on a page in the rough draft of a report.

4. The number of times that a rifle shooter hits a target if he shoots 10 times.

5. The number of phone calls that a salesperson gets in the next hour.

6. The number of minutes that the salesperson is waiting before her next phone call.

7. The time of day that the next major earthquake occurs in Southern California.