## Lab 2: Comparing the 50 states (Summary statistics and regression)

In this lab, you will learn how to use Minitab to obtain numerical summaries of data. You will also learn how to compare data across groups and to explore relationships between variables.

The Data

First, open the data set STATES, which is available in TritonEd.

We have data on the 50 states, with an emphasis on economic, demographic, and health variables. The data set contains 50 rows (one for each state) and the following columns:

 Variable Name Description State The name of the state Region The region of the country (West, South, Midwest, or Northeast) in which the state is located Poverty The percentage of individuals in the state below the poverty line in 2014 Income Median annual household income in the state in 2014 InfMort Infant deaths per 1000 births, 2011-2013 LowBwt Percentage of live births under 2500 grams in 2013 Smoke Percentage of people over age 18 who smoked in 2014 LifeExp Life Expectancy in the state in 2010 Diabetes Percentage of adults with diabetes in 2014 Asthma Percentage of adults with asthma in 2014 AfAm Percentage of people in the state in 2014 who were African-American AsAm Percentage of people in the state in 2014 who were Asian-American Hispanic Percentage of people in the state in 2014 of Hispanic or Latino origin

The variables AfAm, AsAm, and Hispanic came from the U.S. Census Bureau.
The variable Poverty came from WorldAtlas.
The other variables came from the State Health Facts web site provided by the Kaiser Foundation.

Summarizing individual variables

In Lab 1, you learned how to summarize data graphically. In this lab, you will learn how to use Minitab to obtain numerical as well as graphical summaries of data. First practice with the Poverty variable. Go to Stat --> Basic Statistics --> Display Descriptive Statistics. Double click on the Poverty variable on the left, then click "OK". You should see, for example that the mean poverty rate for the 50 states is 13.966 percent, while the standard deviation is 3.879 percent. Also, the lower and upper quartiles are 10.875 and 16.85 percent respectively. Take a look at a histogram of the poverty rates (remember you can go to Graph --> Histogram), and you should see that about half of the states' poverty rates fall within this range.

In Minitab Express, you go to Statistics --> Summary Statistics --> Descriptive Statistics. Then double click on "Poverty" to choose it as the Variable, and leave the Group variable box blank. Then click "OK".

If you want to look at a different set of summary statistics than the default ones, click on "Statistics" before clicking "OK". Then you can check whatever variables you want to see. For example, you could check interquartile range if you don't want to figure it out by subtracting the lower quartile from the upper quartile, and you could uncheck things like SE of mean which we haven't learned about yet, or the number of observations, which we know is 50. Now answer the following questions.
1. Examine (and present) a histogram of the variable "AfAm". Remember you can change the number of bars by double clicking inside one of the bars and selecting the "Binning" tab. As always, it is a good idea to select the "Cutpoint" interval type. This keeps the lowest bar from extending below zero, as negative values for this variable do not make sense. Also make sure, as always, that your graph is appropriately labeled. Would you describe the distribution of the percentage of African-American residents in the 50 states as symmetric or skewed? Find the mean, median, lower quartile, and upper quartile. Is the mean greater than, less than, or about the same as the median? Is the lower quartile closer to the median, farther from the median, or about the same distance from the median as the upper quartile?

2. Next examine (and present) a histogram of the variable "Income". Would you describe the distribution of the median incomes of the 50 states as symmetric or skewed? Answer the same questions that you answered for the "AfAm" variable.

3. What do you conclude about how skewness affects the mean and median? What do you conclude about how skewness affects the distance between the median and the lower and upper quartiles?

4. Examine (and present) a histogram of the "As-Am" variable. Is the distribution symmetric or skewed? Are there any outliers? Find the mean, median, standard deviation, and interquartile range.

5. What state has the largest percentage of Asian-Americans, and what percentage of people in that state are Asian-Americans? You can find this by scrolling down the data file. Remove this state from consideration, and calculate the mean, median, standard deviation, and interquartile range for the other 49 states. (There are several ways to remove a state, but probably the easiest is to put the cursor over the cell in the data file that you want to remove and hit the delete key, then remember the value and put it back when you are finished with this question. Do not go to Edit --> Delete Cells, as this will alter the alignment of the data.) Which of these four summary statistics are strongly affected by the outlier, and which ones are not?

Comparing variables across regions

Boxplots are a useful graphical tool for comparing different regions of the country. Again, start with the "Poverty" variable. Recall that you can make a boxplot of the poverty rates by going to Graph --> Boxplot, clicking "OK", then selecting the "Poverty" variable and clicking "OK" again. To compare poverty rates in the four regions of the country, we want to put four boxplots side by side. To do this, go to Graph --> Boxplot again, but this time click on "With Groups" (under "One Y") before clicking "OK". Select "Poverty" as the graph variable. Then click in the window labeled "categorical variables for grouping" and select the "Region" variable. Then click "OK", and you should get four side-by-side boxplots. You can also get separate summary statistics for the four regions of the country by going to Stat --> Basic Statistics --> Display Descriptive Statistics, choosing "Poverty", then clicking in the "By variables" box and selecting "Region".

In Minitab Express, you make the boxplots by going to Graphs --> Boxplot. To get a boxplot of the poverty rates of the 50 states, select "Simple" under "Single Y variable", then choose the Poverty variable and click "OK". To get the side-by-side boxplots for the four regions of the country, choose instead "With Groups" under "Single Y variable", then select Poverty as the Y variable and put Region in the box for Group Variables.

Now answer the following questions in a few sentences each, providing some plots to support your answers.
1. What differences, if any, do you see in the levels of poverty in the four regions of the country.

2. What differences, if any, do you see in the percentages of African-Americans and the percentages of people of Hispanic or Latino origin across the four regions of the country.

3. Now consider the health-related variables. Are there differences in the infant mortality rates in the four regions of the country? What about the life expectancies?

Exploring relationships between variables

Scatterplots are the standard way of displaying relationships between two quantitative variables. To make a scatterplot, go to Graph --> Scatterplot, click "OK", then select a Y-variable and an X-variable, then click "OK" again. Labels can be edited just as they can for histograms. You can also find the correlation between two variables by going to Stat --> Basic Statistics --> Correlation, then selecting two variables and clicking "OK". The "Pearson correlation" is the correlation that we have learned about.

In Minitab Express, you make a scatterplot by going to Graphs --> Scatterplot. Then select "Simple" under "Single Y variable", select a Y-variable and an X-variable, and click "OK". To compute the correlation, go to Statistics --> Regression --> Correlation. Then select two variables and click "OK".

Now answer the following questions in a few sentences each, providing plots when necessary to support your answers.
1. Investigate the relationship between poverty and smoking. From a scatterplot, do you see a strong relationship between the poverty rate in a state and the percentage of adults who smoke? What is the correlation between the poverty rate and the percentage of adults who smoke?

2. Now investigate how poverty is associated with specific health conditions. Describe the relationship between the poverty rate and the percentage of people with diabetes. Describe the relationship between the poverty rate and the percentage of people with asthma.

3. What kind of relationship, if any, do you see between the percentage of people who smoke and the life expectancy in the state?

4. Is the relationship that you observed in response to the previous question sufficient to prove that smoking causes lower life expectancies?

5. Is it possible to use the correlation to summarize the relationship between "Region" and life expectancy? Explain your answer.

Linear regression

Your next goal will be to predict the infant mortality rate from the percentage of babies with low birthweight, using linear regression. Go to Stat --> Regression --> Regression --> Fit Regression Model. Select "InfMort" as the response variable and "LowBwt" as the continuous predictor. Leave the box for "categorical predictors" blank. Minitab gives you a lot of output, much of which we are not ready to use. For now, just focus on the regression equation and the R-squared value (labeled "R-sq") given near the beginning of the output. Ignore the "Adjusted R-squared" and "Predicted R-squared", which are not relevant when there is just one explanatory variable.

In Minitab Express, you go to Statistics --> Regression --> Simple Regression. Select "InfMort" as the Response (Y) and "LowBwt" as the Predictor (X). Then click "OK".

Minitab's regression output does not give you a scatterplot automatically, but you can get a scatterplot with a regression line drawn in by going to Graph --> Scatterplot and clicking on "With Regression", then proceeding to make the scatterplot as before. Also, to get a residual plot along with your regression output, go to Stat --> Regression --> Regression --> Fit Regression Model and click "Graphs". Click in the box "Residuals versus the variables" and then select your predictor variable, which is "LowBwt". This will give you a plot of the residuals against the X-variable as part of your regression output. (You could also check the box "residuals vs fits" to get a plot of the residuals against the fitted values.) Now answer the following questions.

Minitab Express does provide a scatterplot automatically with the regression output. To get a residual plot, select "Graphs" before clicking "OK", click in the box "Residuals versus the variables" and then select your predictor variable.
1. Give the equation of your regression line.

2. What percentage of the variation in infant mortality rates can be explained by the percentages of low birthweight babies?

3. From your scatterplot and residual plot, does it appear that linear regression is appropriate for these data? Show the scatterplot and residual plot, and write a few sentences explaining your answer.

4. What would the regression predict to be the infant mortality rate in California? How does this compare to the actual infant mortality rate in California?