## Lab 8: Predicting children's growth (Regression inference)

In this lab, you will investigate how fast children grow. You will determine how to predict a child's height or weight at a later age from their height or weight at an earlier age.

The Data

First, open the data set GROWTH, which is available in TritonEd.

The data come from the Berkeley guidance study of children and were found here. The study involved 136 children, all born in Berkeley, CA in 1928-1929. These children were measured at ages 2, 9, and 18. The results of the original study were published in [R. D. Tuddenham and M. M. Snyder (1954). Physical growth of California boys and girls from birth to eighteen years. University of California Publications in Child Development, 1, 183-364].

 Variable Name Description Gender M = Male, F = Female WT2 Weight of the child in kilograms at age 2 HT2 Height of the child in centimeters at age 2 WT9 Weight of the child in kilograms at age 9 HT9 Height of the child in centimeters at age 9 WT18 Weight of the child in kilograms at age 18 HT18 Height of the child in centimeters at age 18

Minitab Instructions

Recall that you can make scatterplots by going to Graph --> Scatterplot and carry out linear regression by going to Stat --> Regression --> Regression --> Fit Regression Model. The table of output that you get includes not only the regression coefficients but also the information that you need to do inference for the regression slope: the standard errors for the estimated coefficients, and the t-statistic and p-value for the test that the slope is zero against a two-sided alternative.

To make sure you get a residual plot along with your regression, click on "Graphs" and the put the explanatory variable in the box under "Residuals versus the variables". You can also get a histogram or a normal probability plot of the residuals by checking the appropriate boxes.

Minitab will also calculate confidence intervals for the mean response and prediction intervals for an individual response, given a particular value for the explanatory variable. To obtain these intervals, after you have fit the regression model, go to Stat --> Regression --> Regression --> Predict. You can type in a value for the explanatory variable in the boxes provided under "Enter individual values". Then click OK. Minitab will provide you with a prediction (under "Fit"), a 95 percent confidence interval for the mean response for this value of the explanatory variable (under "95% CI"), and a 95 percent prediction interval for an individual response at this value of the explanatory variable (under "95% PI").

Minitab Express Instructions

Recall that you can make scatterplots by going to Graphs --> Scatterplot and carry out linear regression by going to Statistics --> Regression --> Simple Regression. The table of output that you get includes not only the regression coefficients but also the information that you need to do inference for the regression slope: the standard errors for the estimated coefficients, and the t-statistic and p-value for the test that the slope is zero against a two-sided alternative.

To make sure you get a residual plot along with your regression, click on "Graphs", check the "Residuals versus the variables" box, and double click on the explanatory variable to put it in that box. You can also get a histogram and a normal probability plot of the residuals (along with two other plots which you should ignore) by checking the box "Residual plots".

Minitab will also calculate confidence intervals for the mean response and prediction intervals for an individual response, given a particular value for the explanatory variable. To obtain these intervals, after you have fit the regression model, go to Statistics --> Regression --> Predict for Regression. You can type in one or more values for the explanatory variable in the boxes provided under "Enter values for each predictor". Then click OK. Minitab will provide you with a prediction (under "Fit"), a 95 percent confidence interval for the mean response for this value of the explanatory variable (under "95% CI"), and a 95 percent prediction interval for an individual response at this value of the explanatory variable (under "95% PI"). When you fit the regression model, you also have the option of going to "Options" and checking the two boxes, which gives you a graph in which curves representing the boundaries of these confidence intervals and prediction intervals are plotted.

Predicting children's heights
1. Begin by using linear regression to predict a child's height at age 9 from the child's height at age 2. What is the equation of your regression line? Based on your scatterplot and residual plot, does linear regression seem like an appropriate way to predict heights?

2. Next try using linear regression to predict a child's height at age 18 from the child's height at age 9. What is the equation of your regression line? Does linear regression seem appropriate for these data?
Next investigate graphically whether boys and girls exhibit different growth patterns between age 2 and age 9. Go to Graph --> Scatterplot and click on "With regression and groups" and then click "OK". Choose HT9 as the Y variable and HT2 as the X variable, then put Gender in the box "Categorical variables for grouping" and click OK. You will get a scatterplot with height at age 9 on the y-axis and height at age 2 on the x-axis. The points for boys and girls will be in different colors. You will also see two separate regression lines drawn on the graph, one for the boys and one for the girls. Of course, you can then make a similar plot with height at age 18 on the y-axis and height at age 9 on the x-axis.

In Minitab Express, you can go to Graphs --> Scatterplot and select "With Groups". Choose HT9 as the Y variable and HT2 as the X variable, and put Gender in the box for "Group Variable". After you make the scatterplot, you can put the two regression lines on the graph by clicking inside the graph, clicking on the plus sign, and checking the box for "Regression Fit".
1. Is there a big difference between how much boys and girls grow in height between age 2 and age 9, or does the regression line you found in question 1 appear to work pretty well for both boys and girls?

2. Now consider the period between age 9 and age 18. Is there a big difference between the growth patterns of boys and girls in height during this period, or does the regression line you found in question 2 work well for both boys and girls?
For the rest of this section (questions 5-11), you will consider only the girls. Go to Data --> Subset Worksheet, give a name to your new worksheet, then click on "Condition" and in the box labeled "Condition", type in 'Gender' = "F" (including the quotes), then click "OK" twice. You should get a Minitab worksheet that includes only the 70 girls in the original data set.
[Note: if you downloaded Minitab onto your personal Windows computer, this works slightly differently than it does in the AP&M labs. First go to Data --> Subset Worksheet as above. Under "How do you want to create a subset?", select "Use rows that match a condition". In the Column box, select the Gender variable, and then check the F option. Click "OK" and you should get a Minitab worksheet that includes only the girls.]

If you are using Minitab Express, you can create the new worksheet by going to File --> New and copying and pasting only the girls into the new worksheet.
1. Find the equation of a regression line that can be used to predict a girl's height at age 18 from the girl's height at age 9.

2. What percentage of the variation in the girls' heights at age 18 is explained by this regression.

3. Are the assumptions required for statistical inference satisfied? Explain how you arrive at your conclusions and provide supporting plots.

4. Can you conclude that there is an association between girls' heights at age 18 and their heights at age 9? Make sure to state your null and alternative hypotheses and give the T-statistic and p-value for your test. Use significance level .05.

5. Find a 95 percent confidence interval for the slope of your regression line. Explain carefully in a sentence or two what this confidence interval means. (Hint: if you want to find the critical value for the t-distribution in Minitab, go to Calc --> Probability Distributions --> t, click the "Inverse cumulative probability" bubble and type in the appropriate number of degrees of freedom. Then click the "Input Constant" bubble, and in the box type in the amount of area that will be to the left of the critical value you are looking for, which is .975 for a 95 percent confidence interval.)

If you are using Minitab Express, you get this critical value by going to Statistics --> Probability Distributions --> Inverse Cumulative Distribution Function. In the "Value" box, enter the amount of area that will be to the left of the critical value you are looking for, which is .975 for a 95 percent confidence interval. Change the distribution to "t", and enter the appropriate number of degrees of freedom. Then click "OK", and the critical value will appear under "x" in the table of output.

6. If a girl is 135 centimeters tall at age 9, find an interval that you are 95 percent confident will contain the girl's height at age 18. (Hint: Minitab can provide this interval. See the instructions at the beginning of the lab.)

7. Find an interval that you are 95 percent confident will contain the average height at age 18 of all girls who are 135 centimeters tall at age 9.

Predicting children's weights

For the rest of this lab, you will consider only the boys. You will need to create a separate worksheet for the boys just as you did above for the girls.
1. Find the equation of a regression line that can be used to predict a boy's weight at age 18 from the boy's weight at age 9. Comment on what you see in the scatterplot and the residual plot.

2. You should have noticed that the data set contains some outliers, including one rather extreme outlier that represents a boy who weighed nearly 67 kilograms at age 9. Try removing this outlier. (The easiest way to do this is to scroll down the list and find the outlier, move your cursor over the cells that you want to delete, and hit the delete key to place the value by an asterisk. Remember the value in case you want to put it back later.) Then do the linear regression again. This time, do the assumptions for inference appear to be satisfied?

3. How much effect was the outlier having on the slope of the regression line? Would you say that this outlier is an influential point? Is it a high leverage point?

4. Find an interval that you are 95 percent confident will contain the weight at age 18 of a boy who weights 29 kilograms at age 9. Use whichever model you think is most appropriate for answering this question. Make sure to indicate which model you chose.