Data

Today we are going to study linear regression using one of the available datasets in R. The dataset cars is taken from R package datasets. The dataframe is very straightforward. One column gives the speed of cars, and the other column reports the distances taken for the cars to stop.

library(datasets)
cars

There are 50 observations. Let’s take a look at the scatter plot of the data. Notice that the simple command plot will plot the dataframe automatically for you. It also assumes that the second variable is assigned to the y-axis and the first variable is assigned to the x-axis. In this case, it happens that the distance needed for a car to stop is our variable of interest. Having dist on the y-axis as the response seems to be a good choice.

plot(cars)

Clearly, there is a positive correlation between dist and speed. Intuitively, this also makes sense, as the faster one drives, the longer the distances required to bring the car to a full stop.

Linear Regression

Here we are going to consider a least squares fit of the data. Please keep in mind that least squares is not the only option for regression. Least squares solution minimizes the sum of squared distances from the data points to the linear fit, whereas there is also the option to minimzes the sum of absolute distances for example.

Rather than hand-computing the least squares solution, we will utilize one of the most popular functions in R, lm() function. The main argument of the function is the first argument formula. It should be provided in the form y ~ x, where y is the response and x is the explanatory variable.

fit <- lm(formula=dist~speed, data=cars)
fit

Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed  
    -17.579        3.932  

As we see from the output of the linear model fit, the coefficients, intercept and slope, are both calculated already. Let’s take a look at the scatter plot again with the least squares fit.

plot(cars)
abline(fit, col="red")

There are some obvious issues that one might see from this fit. For example, the intercept is negative. The logical intercept is at zero, since a car with speed zero should require zero distance to stop. (I believe the speed and distance here follow the common understanding people share in daily conversations, not the rigorous ones presented in a physics class.) However, data is not perfect, there are always errors that come into play.

Diagnostics

Three characteristics have been mentioned in the lecture slides, linearity, residual normality, and constant variability. We first check linearity. For linearity, we usually check for the scatter plot and the residual plot. As we have seen the scatter plot for a couple of times already, we will take a look at the residual plot.

plot(fit$residuals)
abline(0, 0, col="red")

As linearity has been checked, we move on to residual normality. Let’s take a look at the histogram and the Q-Q plots of the residuals.

hist(fit$residuals)

qqnorm(fit$residuals)
qqline(fit$residuals, col="red")

The Q-Q plot does indicate that the data points with large deviations away from the theoretical quantile line may be outliers. But this may also be due to the small sample size we have, as the dataset only consists of 50 observations.

Last but not the least, we will check for the constant variability. Usually, this is verified by examining the residual plot. A fan shape in the residual plot indicates heteroscedacity (unequal variance). This is not the case with the current data we have.

plot(fit$residuals)
abline(0, 0, col="red")

In addition, one may want to identify outliers within the dataset. For instance, we had doubt about the data points with fit residuals largely deviating from the theoretical Q-Q line. Let’s figure out which data points these residuals correspond to.

res.rank <- sort(fit$residuals)
suspect <- which(fit$residuals %in% res.rank[47:50])

Let’s see how different the linear fit is, if we remove these points. The red line is the original fit, whereas the blue line is the fit with the two points removed.

plot(cars)
abline(fit, col="red")
fit.out <- lm(formula=dist[-suspect]~speed[-suspect], data=cars)
abline(fit.out, col="blue")

We do notice that the blue line has been tilted downward from the red line, and we do have a slightly better intercept. One may consider the two points as influential points, as they influence the slope and can be understood as outliers in response.

There are more rigorous procedures, such as various testing methods, to detect outliers in both response and the exploratory variables. However, they are out the scope of this class. We might talk about them later on.

More Details on Linear Regression

One of the most important measure in evaluating the linear fit of the data is through \(R^2\), which is the squared correlation coefficient. It is formally defined as \[R^2 = \frac{SSE_{reg}}{SSE_{total}}, \] where \(SSE_{reg} = \sum_{i=1}^n (\hat y_i - y_i)^2\) and \(SSE_{total} = \sum_{i=1}^n (y_i - \bar y)^2\), \(\hat y_i\) being the fitted values using the fitted linear model and \(\bar y\) being the mean of the response. This is understood as the fraction of variability explained by the regression model.

Let’s try to calculate this using the linear model we fitted earlier.

R.square <- sum((fit$fitted.values-mean(cars$dist))^2) / (sum((cars$dist - mean(cars$dist))^2))
R.square
[1] 0.6510794

As this number is used so often, one does not have to calculate this every time by hand. \(R^2\) information can be found using the summary() function for lm() fits.

summary(fit)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Usually, a larger \(R^2\) value means a larger proportion of the variation has been explained by the regression model, which seems very promising. However, a not of warning, \(R^2\) must be compared on the same level of model complexity. For example, one can fit the response in a linear model perfectly, if one can have arbitrary number of explanatory variables. In such cases, \(R^2\) value will be at its maximum possible value, which is 1. However, such a fitted model is not helpful, as it will not be able to predict unseen data well.

The summary() function actually provides much more information than merely \(R^2\). In fact, it contains the testing information under the coefficients.

Denote the intercept coefficient as \(\beta_0\) and the speed coefficient as \(\beta_1\). The two lines regarding testing in the summary refer to the following two tests. The first line tests \(H_0: \beta_0 = 0\) against \(H_1: \beta_0 \neq 0\). The second line tests \(H_0: \beta_1 = 0\) against \(H_1: \beta_1 \neq 0\).

If we further denote the estimates as \(\hat \beta_0\) and \(\hat \beta_1\), and the standard errors as \(\sigma_0\) and \(\sigma_1\). The test statistics respectively are \[\frac{\hat \beta_i - 0}{\sigma_i} \sim T_{n - 2},\] where \(i = 0 \text{ or } 1\).

---
title: "Math 189/289 Lab 7"
author: "Jiaqi Guo"
output:
  html_notebook: default
  pdf_document: default
---

# Data
Today we are going to study linear regression using one of the available datasets in R. The dataset *cars* is taken from R package *datasets*. The dataframe is very straightforward. One column gives the speed of cars, and the other column reports the distances taken for the cars to stop.
```{r}
library(datasets)
cars
```

There are 50 observations. Let's take a look at the scatter plot of the data. Notice that the simple command plot will plot the dataframe automatically for you. It also assumes that the second variable is assigned to the y-axis and the first variable is assigned to the x-axis. In this case, it happens that the distance needed for a car to stop is our variable of interest. Having *dist* on the y-axis as the response seems to be a good choice.
```{r}
plot(cars)
```

Clearly, there is a positive correlation between *dist* and *speed*. Intuitively, this also makes sense, as the faster one drives, the longer the distances required to bring the car to a full stop.

# Linear Regression
Here we are going to consider a least squares fit of the data. Please keep in mind that least squares is not the only option for regression. Least squares solution minimizes the sum of squared distances from the data points to the linear fit, whereas there is also the option to minimzes the sum of absolute distances for example.

Rather than hand-computing the least squares solution, we will utilize one of the most popular functions in R, *lm()* function. The main argument of the function is the first argument *formula*. It should be provided in the form y ~ x, where y is the response and x is the explanatory variable.
```{r}
fit <- lm(formula=dist~speed, data=cars)
fit
```

As we see from the output of the linear model fit, the coefficients, intercept and slope, are both calculated already. Let's take a look at the scatter plot again with the least squares fit.
```{r}
plot(cars)
abline(fit, col="red")
```

There are some obvious issues that one might see from this fit. For example, the intercept is negative. The logical intercept is at zero, since a car with speed zero should require zero distance to stop. (I believe the speed and distance here follow the common understanding people share in daily conversations, not the rigorous ones presented in a physics class.) However, data is not perfect, there are always errors that come into play. 

# Diagnostics
Three characteristics have been mentioned in the lecture slides, linearity, residual normality, and constant variability. We first check linearity. For linearity, we usually check for the scatter plot and the residual plot. As we have seen the scatter plot for a couple of times already, we will take a look at the residual plot.
```{r}
plot(fit$residuals)
abline(0, 0, col="red")
```

As linearity has been checked, we move on to residual normality. Let's take a look at the histogram and the Q-Q plots of the residuals.
```{r}
hist(fit$residuals)
qqnorm(fit$residuals)
qqline(fit$residuals, col="red")
```
The Q-Q plot does indicate that the data points with large deviations away from the theoretical quantile line may be outliers. But this may also be due to the small sample size we have, as the dataset only consists of 50 observations.

Last but not the least, we will check for the constant variability. Usually, this is verified by examining the residual plot. A fan shape in the residual plot indicates heteroscedacity (unequal variance). This is not the case with the current data we have.
```{r}
plot(fit$residuals)
abline(0, 0, col="red")
```

In addition, one may want to identify outliers within the dataset. For instance, we had doubt about the data points with fit residuals largely deviating from the theoretical Q-Q line. Let's figure out which data points these residuals correspond to.
```{r}
res.rank <- sort(fit$residuals)
suspect <- which(fit$residuals %in% res.rank[47:50])
```

Let's see how different the linear fit is, if we remove these points. The red line is the original fit, whereas the blue line is the fit with the two points removed.
```{r}
plot(cars)
abline(fit, col="red")
fit.out <- lm(formula=dist[-suspect]~speed[-suspect], data=cars)
abline(fit.out, col="blue")
```

We do notice that the blue line has been tilted downward from the red line, and we do have a slightly better intercept. One may consider the two points as influential points, as they influence the slope and can be understood as outliers in response. 

There are more rigorous procedures, such as various testing methods, to detect outliers in both response and the exploratory variables. However, they are out the scope of this class. We might talk about them later on.

# More Details on Linear Regression
One of the most important measure in evaluating the linear fit of the data is through $R^2$, which is the squared correlation coefficient. It is formally defined as 
$$R^2 = \frac{SSE_{reg}}{SSE_{total}}, $$
where $SSE_{reg} = \sum_{i=1}^n (\hat y_i - y_i)^2$ and $SSE_{total} = \sum_{i=1}^n (y_i - \bar y)^2$, $\hat y_i$ being the fitted values using the fitted linear model and $\bar y$ being the mean of the response. This is understood as the fraction of variability explained by the regression model.

Let's try to calculate this using the linear model we fitted earlier. 
```{r}
R.square <- sum((fit$fitted.values-mean(cars$dist))^2) / (sum((cars$dist - mean(cars$dist))^2))
R.square
```

As this number is used so often, one does not have to calculate this every time by hand. $R^2$ information can be found using the *summary()* function for *lm()* fits.
```{r}
summary(fit)
```

Usually, a larger $R^2$ value means a larger proportion of the variation has been explained by the regression model, which seems very promising. However, a not of warning, $R^2$ must be compared on the same level of model complexity. For example, one can fit the response in a linear model perfectly, if one can have arbitrary number of explanatory variables. In such cases, $R^2$ value will be at its maximum possible value, which is 1. However, such a fitted model is not helpful, as it will not be able to predict unseen data well.

The *summary()* function actually provides much more information than merely $R^2$. In fact, it contains the testing information under the coefficients.

Denote the intercept coefficient as $\beta_0$ and the speed coefficient as $\beta_1$. The two lines regarding testing in the summary refer to the following two tests. The first line tests $H_0: \beta_0 = 0$ against $H_1: \beta_0 \neq 0$. The second line tests $H_0: \beta_1 = 0$ against $H_1: \beta_1 \neq 0$.

If we further denote the estimates as $\hat \beta_0$ and $\hat \beta_1$, and the standard errors as $\sigma_0$ and $\sigma_1$. The test statistics respectively are 
$$\frac{\hat \beta_i - 0}{\sigma_i} \sim T_{n - 2},$$
where $i = 0 \text{ or } 1$.
