Lab 2: Comparing the 50 states (Summary statistics and regression)

In this lab, you will learn how to use MINITAB to obtain numerical summaries of data. You will also learn how to compare data across groups and to explore relationships between variables.

The Data

To load the data set STATES, click here. We have data on the 50 states, with an emphasis on economic, educational, and demographic variables. The data set contains 50 rows (one for each state) and the following 13 columns:

Variable Name       Description
State The name of the state
Region The region of the country (West, South, Midwest, or Northeast) in which the state is located
Over75 Percentage of the state's residents age 75 or older
Income The state's personal per capita income in 2006
Unemployment The state's unemployment rate in 2005
Poverty The percentage of individuals in the state below the poverty line in 2005
TeacherPay Average salary of public school teachers in the state in 2005
S/T Ratio The number of public school students in the state for each public school teacher in 2005
PSSpend Public expenditures on K-12 education per student in the state in 2005
Math Scale score for the state in 4th grade Math in 2005, from the National Assessment of Educational Progress
AfAm Percentage of people in the state in 2006 who were African-American
AsAm Percentage of people in the state in 2006 who were Asian-American
Hispanic Percentage of people in the state in 2006 of Hispanic or Latino origin

The variable Region came from the website FedStats at http://www.fedstats.gov/.
The variables S/T Ratio and PSSpend came from the National Center for Educational Statistics and can be found here.
All remaining variables came from [U.S. Census Bureau, Statistical Abstract of the United States: 2008 (127th Edition) Washington, DC, 2007], which is available here.

Summarizing individual variables

In Lab 1, you learned how to summarize data graphically. In this lab, you will learn how to use MINITAB to obtain numerical as well as graphical summaries of data. First practice with the Unemployment variable. Go to Stat --> Basic Statistics --> Display Descriptive Statistics. Choose the Unemployment variable on the left and click "Select", then "OK". You should see, for example that the mean unemployment rate for the 50 states is 4.884 percent, while the standard deviation is 1.059 percent. Also, the lower and upper quartiles are 4.0 and 5.4 percent respectively. Take a look at a histogram of the unemployment rates (remember you can go to Graph --> Histogram), and you should see that about half of the states' unemployment rates fall within this range.

If you want to look at a different set of summary statistics than the default ones, click on "Statistics" before clicking "OK". Then you can check whatever variables you want to see. For example, you could check interquartile range if you don't want to figure it out by subtracting the lower quartile from the upper quartile, and you could uncheck things like SE of mean that we haven't learned about yet, or the number of observations, which we know is 50. Now answer the following questions.
  1. Examine (and present) a histogram of the variable "Af-Am". Remember you can change the number of bars by double clicking inside one of the bars and selecting the "Binning" tab. It is a good idea here to select the "Cutpoint" interval type to keep the lowest bar from extending below zero, as negative values for this variable do not make sense. Also make sure, as always, that your graph is appropriately labeled. Would you describe the distribution of the percentage of African-American residents in the 50 states as symmetric or skewed? Find the mean, median, lower quartile, and upper quartile. Is the mean greater than, less than, or about the same as the median? Is the lower quartile closer to the median, farther from the median, or about the same distance from the median as the upper quartile?

  2. Next examine a histogram of the variable "Over75". Would you describe the distribution of the percentages of people over age 75 in the 50 states as symmetric or skewed? Answer the same questions that you answered for the "Af-Am" variable.

  3. What do you conclude about how skewness affects the mean and median? What do you conclude about how skewness affects the distance between the median and the lower and upper quartiles?

  4. Examine a histogram of the "As-Am" variable. Is the distribution symmetric or skewed? Are there any outliers? Find the mean, median, standard deviation, and interquartile range.

  5. What state has the largest percentage of Asian-Americans, and what percentage of people in that state are Asian-Americans? You can find this by scrolling down the data file. Remove this state from consideration, and calculate the mean, median, standard deviation, and interquartile range for the other 49 states. [There are several ways to do remove a state, but probably the easiest is to put the cursor over the cell in the data file that you want to remove and hit the delete key, then remember the value and put it back when you are finished with this question. (Do not go to Edit --> Delete Cells, as this will alter the alignment of the data.) An alternative is to go to Data --> Subset Worksheet and type in a condition that excludes just the one state. This will give you a new worksheet with one state missing.] Which of these four summary statistics are strongly affected by the outlier?

Comparing variables across regions

Boxplots are a useful graphical tool for comparing different regions of the country. Again, start with the unemployment variable. Recall that you can make a boxplot of the unemployment rates by going to Graph --> Boxplot, clicking "OK", then selecting the "Unemployment" variable and clicking "OK" again. To compare unemployment in the four regions of the country, we want to put four boxplots side by side. To do this, go to Graph --> Boxplot again, but this time click on "With Groups" before clicking "OK". Select Unemployment as the graph variable. Then click in the window labeled "categorical variables for grouping" and select the "Region" variable. Then click "OK", and you should get four side-by-side boxplots. You can also get separate summary statistics for the four regions of the country by going to Stat --> Basic Statistics --> Display Descriptive Statistics, choosing "Unemployent", then clicking in the "By variables" box and selecting "Region". Now answer the following questions in a few sentences each, providing some plots to support your answers.
  1. What differences, if any, do you see between the four regions of the country in the economic variables (unemployment, income, and poverty)?

  2. What differences, if any, do you see between the four regions of the country in the performance of fourth-graders in Math?

  3. What differences, if any, do you see between the percentages of African-Americans, the percentages of Asian-Americans, and the percentages of people of Hispanic or Latino origin across the four regions of the country.

Exploring relationships between variables

Scatterplots are the standard way of depicting relationships between two quantitative variables graphically. To make a scatterplot, go to Graph --> Scatterplot, click "OK", then select a Y-variable and an X-variable, then click "OK" again. Labels can be edited just as they can for histograms. You can also find the correlation between two variables by going to Stat --> Basic Statistics --> Correlation, then selecting two variables and clicking "OK". The "Pearson correlation" is the correlation that we have learned about. Now answer the following questions in a few sentences each, providing plots if necessary to support your answers.
  1. From a scatterplot, do you see a strong relationship between the performance of fourth graders in Math and the amount of educational spending per student? What is the correlation between these two variables?

  2. What kind of relationship, if any, do you see between the performance of fourth graders in Math and teachers' salaries? Is there a relationship between the performance of fourth graders in Math and the ratio of students to teachers?

  3. What kind of relationship, if any, do you see between the performance of fourth graders in Math and economic variables such as per capita income, unemployment, and poverty.

Linear regression

Your next goal will be to predict fourth-grade math performance from the poverty rate, using linear regression. Go to Stat --> Regression --> Regression. Select "Math" as the response variable and "Poverty" as the predictor. MINITAB gives you a lot of output, much of which we are not ready to use. For now, just focus on the regression equation and the R-squared value (labeled "R-sq") given near the beginning of the output.

MINITAB's regression output does not give you a scatterplot automtically, but you can get a scatterplot with a regression line drawn in by going to Graph --> Scatterplot and clicking on "With Regression", then proceeding to make the scatterplot as before. Also, to get a residual plot along with your regression output, go to Stat --> Regression --> Regression and click "Graphs". Click in the box "Residuals versus the variables" and then select your predictor variable, which is "Poverty". This will give you a plot of the residuals against the X-variable as part of your regression output. (You could also check the box "residuals vs fits" to get a plot of the residuals against the fitted values.) Now answer the following questions.
  1. Give the equation of your regression line.

  2. What percentage of the variation in fourth-grade Math performance among the 50 states can be explained by differences in the poverty rate among the 50 states?

  3. From your scatterplot and residual plot, does it appear that linear regression is appropriate for these data? Make sure to show the plots.

  4. What would the regression predict to be the scale score for fourth grade Math in California? How does this prediction compare to the actual score?

  5. Do your results prove that poverty causes poor mathematical performance among fourth graders?

Remember that if you discussed this assignment with anyone other than your instructor or TA, then you should add a section called "Acknowledgments" at the end of your report, indicating from whom you received help.