Lab 2: Comparing the 50 states (Summary statistics and regression)

In this lab, you will learn how to use Minitab to obtain numerical summaries of data. You will also learn how to compare data across groups and to explore relationships between variables.

The Data

First, open the data set STATES, which is available in TED.

We have data on the 50 states, with an emphasis on economic, demographic, and health variables. The data set contains 50 rows (one for each state) and the following 11 columns:

Variable Name       Description
State The name of the state
Region The region of the country (West, South, Midwest, or Northeast) in which the state is located
Poverty The percentage of individuals in the state below the poverty line in 2007
Smoke Percentage of people over age 18 who smoked in 2007
Death Age adjusted death rate per 100,000 people in 2006
Heart Age adjusted death rate per 100,000 people from heart disease in 2006
Cancer Age adjusted death rate per 100,000 people from cancer in 2006
Stroke Age adjusted death rate per 100,000 people from cerebrovascular disease in 2006
AfAm Percentage of people in the state in 2008 who were African-American
AsAm Percentage of people in the state in 2008 who were Asian-American
Hispanic Percentage of people in the state in 2008 of Hispanic or Latino origin

The variable Region came from the website FedStats. The other variables came from [U.S. Census Bureau, Statistical Abstract of the United States: 2010 (129th Edition) Washington, DC, 2009], which is available here.

Summarizing individual variables

In Lab 1, you learned how to summarize data graphically. In this lab, you will learn how to use MINITAB to obtain numerical as well as graphical summaries of data. First practice with the Poverty variable. Go to Stat --> Basic Statistics --> Display Descriptive Statistics. Choose the Poverty variable on the left and click "Select", then "OK". You should see, for example that the mean poverty rate for the 50 states is 12.6 percent, while the standard deviation is 3.1 percent. Also, the lower and upper quartiles are 10.4 and 14.3 percent respectively. Take a look at a histogram of the poverty rates (remember you can go to Graph --> Histogram), and you should see that about half of the states' poverty rates fall within this range.

If you want to look at a different set of summary statistics than the default ones, click on "Statistics" before clicking "OK". Then you can check whatever variables you want to see. For example, you could check interquartile range if you don't want to figure it out by subtracting the lower quartile from the upper quartile, and you could uncheck things like SE of mean which we haven't learned about yet, or the number of observations, which we know is 50. Now answer the following questions.
  1. Examine (and present) a histogram of the variable "AfAm". Remember you can change the number of bars by double clicking inside one of the bars and selecting the "Binning" tab. It is a good idea here to select the "Cutpoint" interval type to keep the lowest bar from extending below zero, as negative values for this variable do not make sense. Also make sure, as always, that your graph is appropriately labeled. Would you describe the distribution of the percentage of African-American residents in the 50 states as symmetric or skewed? Find the mean, median, lower quartile, and upper quartile. Is the mean greater than, less than, or about the same as the median? Is the lower quartile closer to the median, farther from the median, or about the same distance from the median as the upper quartile?

  2. Next examine (and present) a histogram of the variable "Smoke". Would you describe the distribution of the percentages of people in the 50 states who smoke as symmetric or skewed? Answer the same questions that you answered for the "AfAm" variable.

  3. What do you conclude about how skewness affects the mean and median? What do you conclude about how skewness affects the distance between the median and the lower and upper quartiles?

  4. Examine (and present) a histogram of the "As-Am" variable. Is the distribution symmetric or skewed? Are there any outliers? Find the mean, median, standard deviation, and interquartile range.

  5. What state has the largest percentage of Asian-Americans, and what percentage of people in that state are Asian-Americans? You can find this by scrolling down the data file. Remove this state from consideration, and calculate the mean, median, standard deviation, and interquartile range for the other 49 states. [There are several ways to do remove a state, but probably the easiest is to put the cursor over the cell in the data file that you want to remove and hit the delete key, then remember the value and put it back when you are finished with this question. (Do not go to Edit --> Delete Cells, as this will alter the alignment of the data.) An alternative is to go to Data --> Subset Worksheet and type in a condition that excludes just the one state. This will give you a new worksheet with one state missing.] Which of these four summary statistics are strongly affected by the outlier, and which ones are not?

Comparing variables across regions

Boxplots are a useful graphical tool for comparing different regions of the country. Again, start with the "Poverty" variable. Recall that you can make a boxplot of the poverty rates by going to Graph --> Boxplot, clicking "OK", then selecting the "Poverty" variable and clicking "OK" again. To compare poverty rates in the four regions of the country, we want to put four boxplots side by side. To do this, go to Graph --> Boxplot again, but this time click on "With Groups" before clicking "OK". Select Poverty as the graph variable. Then click in the window labeled "categorical variables for grouping" and select the "Region" variable. Then click "OK", and you should get four side-by-side boxplots. You can also get separate summary statistics for the four regions of the country by going to Stat --> Basic Statistics --> Display Descriptive Statistics, choosing "Poverty", then clicking in the "By variables" box and selecting "Region". Now answer the following questions in a few sentences each, providing some plots to support your answers.
  1. What differences, if any, do you see in the levels of poverty in the four regions of the country.

  2. What differences, if any, do you see in the percentages of African-Americans, the percentages of Asian-Americans, and the percentages of people of Hispanic or Latino origin across the four regions of the country.

  3. Now consider the health-related variables. Are there differences in smoking rates in the four regions of the country? What about differences in the death rates from heart disease, cancer, and strokes?

Exploring relationships between variables

Scatterplots are the standard way of depicting relationships between two quantitative variables graphically. To make a scatterplot, go to Graph --> Scatterplot, click "OK", then select a Y-variable and an X-variable, then click "OK" again. Labels can be edited just as they can for histograms. You can also find the correlation between two variables by going to Stat --> Basic Statistics --> Correlation, then selecting two variables and clicking "OK". The "Pearson correlation" is the correlation that we have learned about. Now answer the following questions in a few sentences each, providing plots if necessary to support your answers.
  1. Investigate the relationship between poverty and smoking. From a scatterplot, do you see a strong relationship between the poverty rate and the percentage of people who smoke? What is the correlation between the poverty rate and the percentage of people who smoke?

  2. Now investigate how smoking is related to death rates from specific diseases. Describe the relationship between smoking and the age-adjusted death rate? What about the relationship between smoking and the death rates from heart disease, cancer, and strokes. Does smoking seem to have a stronger relationship with some causes of death than others?

  3. Are the relationships that you observed in response to the previous question sufficient to prove that smoking causes higher death rates?

Linear regression

Your next goal will be to predict the age-adjusted death rate from cancer from the percentage of people who smoke, using linear regression. Go to Stat --> Regression --> Regression. Select "Cancer" as the response variable and "Smoke" as the predictor. MINITAB gives you a lot of output, much of which we are not ready to use. For now, just focus on the regression equation and the R-squared value (labeled "R-sq") given near the beginning of the output. Ignore the "Adjusted R-squared", which is not relevant when there is just one explanatory variable.

MINITAB's regression output does not give you a scatterplot automtically, but you can get a scatterplot with a regression line drawn in by going to Graph --> Scatterplot and clicking on "With Regression", then proceeding to make the scatterplot as before. Also, to get a residual plot along with your regression output, go to Stat --> Regression --> Regression and click "Graphs". Click in the box "Residuals versus the variables" and then select your predictor variable, which is "Smoke". This will give you a plot of the residuals against the X-variable as part of your regression output. (You could also check the box "residuals vs fits" to get a plot of the residuals against the fitted values.) Now answer the following questions.
  1. Give the equation of your regression line.

  2. What percentage of the variation in death rates from cancer can be explained by differences in smoking rates across states?

  3. From your scatterplot and residual plot, does it appear that linear regression is appropriate for these data? Show the scatterplot and residual plot, and write a few sentences explaining your answer.

  4. What would the regression predict to be the age-adjusted death rate from cancer in California? How does this compare to the actual age-adjusted death rate from cancer in California?