We are going to talk about ridge regression and lasso regression today.
One way to control the complexity of a model is to penalize its magnitude. For example, in a linear regression problem \[ \min_{\beta \in \mathbb{R}^p} \sum_{i=1}^n (y_i - x_i^\top \beta)^2, \] we can control the magnitude of coefficients \(\beta\). Of course the magnitude of \(\beta\) could be defined in different ways, e.g. two-norm \(\|\beta\|_2\), one-norm \(\|\beta\|_1\) and infinity-norm \(\|\beta\|_{\infty}\). The ridge regression involves the two-norm penalty: \[ \min_{\beta \in \mathbb{R}^p} \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_2^2 \] where \(\lambda\) is a tuning parameter controling the level of regularization. Denote \(X\) to be the \(n\) by \(p\) design matrix with rows to be \(x_i^\top\), and \(Y\) be the \(n\) by 1 vector of \(y_i\)’s. Assume \(X^\top X + \lambda I\) is invertible, we have an explicit solution to the ridge regression problem \[ \hat \beta_{ridge} = (X^\top X + \lambda I)^{-1}X^\top Y. \] Recall that the solution to the ordinary least square regression is (assuming invertibility of \(X^\top X\)) \[ \hat \beta_{ols} = (X^\top X)^{-1}X^\top Y. \] Two facts: When \(\lambda \to 0\), \(\hat \beta_{ridge} \to \hat \beta_{ols}\); when \(\lambda \to \infty\), \(\hat \beta_{ridge} \to 0\).
In the special case when \(X\) is orthonormal (i.e. the columns of \(X\) are orthonormal), we have \[ \hat \beta_{ridge} = \frac{\hat \beta_{ols}}{1 + \lambda}. \] So we can see ridge estimator has an extra \(1/(1 + \lambda)\) shrinkage factor. Therefore, there is bias on the ridge estimator.
First, let’s look at the package MASS
for ridge regression.
library(MASS)
# Generating the data
set.seed(20170315)
N <- 100 # Sample size
x1 <- runif(n=N)
x2 <- runif(n=N)
x3 <- runif(n=N)
x3c <- 10*x1 + x3 # New variable
ep <- rnorm(n=N)
y <- x1 + x2 + ep
data <- cbind.data.frame(x1,x2,x3,x3c,y)
# OLS fit of 3-variable model using independent x3
ols <- lm(y~ x1 + x2 + x3, data = data)
summary(ols)
Call:
lm(formula = y ~ x1 + x2 + x3, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.4405 -0.5798 -0.0383 0.6994 2.7645
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05327 0.32500 0.164 0.87014
x1 0.96184 0.37850 2.541 0.01265 *
x2 1.00767 0.36829 2.736 0.00741 **
x3 -0.16842 0.35969 -0.468 0.64067
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.045 on 96 degrees of freedom
Multiple R-squared: 0.1439, Adjusted R-squared: 0.1172
F-statistic: 5.38 on 3 and 96 DF, p-value: 0.001822
# OLS fit of 3-variable model using correlated x3.
olsc <- lm(y~ x1 + x2 + x3c, data = data)
summary(olsc)
Call:
lm(formula = y ~ x1 + x2 + x3c, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.4405 -0.5798 -0.0383 0.6994 2.7645
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05327 0.32500 0.164 0.87014
x1 2.64608 3.63300 0.728 0.46817
x2 1.00767 0.36829 2.736 0.00741 **
x3c -0.16842 0.35969 -0.468 0.64067
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.045 on 96 degrees of freedom
Multiple R-squared: 0.1439, Adjusted R-squared: 0.1172
F-statistic: 5.38 on 3 and 96 DF, p-value: 0.001822
# Ridge regression using independent variables
ridge <- lm.ridge (y ~ x1+x2+x3, data = data, lambda = seq(0, 50, .001))
plot(ridge)
select(ridge)
modified HKB estimator is 6.935947
modified L-W estimator is 6.196042
smallest value of GCV at 27.498
# Ridge regression using correlated variables
ridgec <- lm.ridge (y ~ x1+x2+x3c, data = data, lambda = seq(0, 50, .001))
plot(ridgec)
select(ridgec)
modified HKB estimator is 1.281794
modified L-W estimator is 6.196042
smallest value of GCV at 22.527
A better package is glmnet
.
#install.packages("ISLR")
library(ISLR)
Hitters=na.omit(Hitters)
library(glmnet)
x=model.matrix(Salary~.-1,data=Hitters)
y=Hitters$Salary
The ridge-regression model is fitted by calling the glmnet function with \(\alpha = 0\). It makes a plot as a function of log of lambda, and is plotting the coefficients.
fit.ridge=glmnet(x,y,alpha=0)
plot(fit.ridge,xvar="lambda",label=TRUE)
We will do cross-validation to select the best \(\lambda\).
cv.ridge=cv.glmnet(x,y,alpha=0)
plot(cv.ridge)
The left vertical line is the one with the minimum MSE. The right vertical line is within one standard error of the minimum, which is a slightly more restricted model that does almost as well as the minimum. We will go for the second one to prevent overfitting.
At the top of the plot, you actually see how many non-zero variables coefficients are in the model. In this case, all 20 variables are included in the model (19 variables plus the intercept) and no coefficient is zero.
Instead of \(L_2\) regularization, LASSO uses \(L_1\) penalization, that is, \[ \min_{\beta \in \mathbb{R}^p} \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_1. \] Because of the nature of \(L_1\) norm, LASSO tends to give more sparse solution than ridge regression. This is typically useful in high-dimensional setting when the true model is actually a low-dimensional embedding.
To fit the lasso model, you can specify \(\alpha = 1\) to the fitting function.
fit.lasso=glmnet(x,y,alpha=1)
plot(fit.lasso,xvar="lambda",label=TRUE)
cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)
cv.lasso$lambda.min
[1] 2.220313
cv.lasso$lambda.1se
[1] 69.40069
We can extract the coefficients.
coef(cv.lasso)
21 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 127.95694754
AtBat .
Hits 1.42342566
HmRun .
Runs .
RBI .
Walks 1.58214111
Years .
CAtBat .
CHits .
CHmRun .
CRuns 0.16027975
CRBI 0.33667715
CWalks .
LeagueA .
LeagueN .
DivisionW -8.06171262
PutOuts 0.08393604
Assists .
Errors .
NewLeagueN .
n <- nrow(x)
train <- sample(1:n, size = floor(0.6*n), replace = FALSE) # use 60% data for training
lasso.tr=glmnet(x[train,],y[train], alpha = 1)
lasso.tr
Call: glmnet(x = x[train, ], y = y[train], alpha = 1)
Df %Dev Lambda
[1,] 0 0.00000 236.3000
[2,] 1 0.05351 215.3000
[3,] 3 0.10900 196.2000
[4,] 3 0.16950 178.8000
[5,] 3 0.21970 162.9000
[6,] 5 0.26190 148.4000
[7,] 5 0.29750 135.2000
[8,] 5 0.32710 123.2000
[9,] 5 0.35160 112.3000
[10,] 6 0.37230 102.3000
[11,] 6 0.38990 93.2200
[12,] 7 0.40550 84.9400
[13,] 7 0.42130 77.3900
[14,] 8 0.43450 70.5200
[15,] 8 0.44580 64.2500
[16,] 8 0.45510 58.5400
[17,] 8 0.46280 53.3400
[18,] 9 0.47030 48.6000
[19,] 10 0.47690 44.2900
[20,] 10 0.48300 40.3500
[21,] 10 0.48800 36.7700
[22,] 10 0.49210 33.5000
[23,] 9 0.49530 30.5200
[24,] 9 0.49800 27.8100
[25,] 9 0.50020 25.3400
[26,] 11 0.50250 23.0900
[27,] 11 0.50490 21.0400
[28,] 11 0.50690 19.1700
[29,] 11 0.50850 17.4700
[30,] 11 0.50990 15.9200
[31,] 11 0.51110 14.5000
[32,] 11 0.51200 13.2100
[33,] 11 0.51280 12.0400
[34,] 11 0.51340 10.9700
[35,] 11 0.51400 9.9950
[36,] 11 0.51440 9.1070
[37,] 11 0.51480 8.2980
[38,] 11 0.51510 7.5610
[39,] 11 0.51540 6.8890
[40,] 11 0.51760 6.2770
[41,] 10 0.52060 5.7200
[42,] 11 0.52340 5.2120
[43,] 11 0.52590 4.7490
[44,] 11 0.52800 4.3270
[45,] 12 0.52980 3.9420
[46,] 12 0.53120 3.5920
[47,] 13 0.53300 3.2730
[48,] 13 0.53490 2.9820
[49,] 13 0.53650 2.7170
[50,] 13 0.53780 2.4760
[51,] 13 0.53890 2.2560
[52,] 13 0.53990 2.0560
[53,] 13 0.54060 1.8730
[54,] 14 0.54130 1.7070
[55,] 14 0.54180 1.5550
[56,] 14 0.54220 1.4170
[57,] 15 0.54260 1.2910
[58,] 15 0.54300 1.1760
[59,] 15 0.54340 1.0720
[60,] 16 0.54370 0.9766
[61,] 16 0.54390 0.8898
[62,] 16 0.54410 0.8108
[63,] 16 0.54430 0.7387
[64,] 16 0.54450 0.6731
[65,] 16 0.54480 0.6133
[66,] 16 0.54490 0.5588
[67,] 16 0.54510 0.5092
[68,] 16 0.54520 0.4639
[69,] 16 0.54530 0.4227
[70,] 17 0.54550 0.3852
[71,] 18 0.54560 0.3510
[72,] 18 0.54580 0.3198
[73,] 18 0.54600 0.2914
[74,] 18 0.54620 0.2655
[75,] 18 0.54620 0.2419
[76,] 18 0.54640 0.2204
[77,] 18 0.54650 0.2008
[78,] 18 0.54660 0.1830
[79,] 18 0.54660 0.1667
[80,] 18 0.54670 0.1519
[81,] 18 0.54670 0.1384
[82,] 18 0.54680 0.1261
[83,] 18 0.54690 0.1149
[84,] 19 0.54690 0.1047
pred=predict(lasso.tr,x[-train,])
# calculate the mse
rmse= sqrt(apply((y[-train]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")
lam.best=lasso.tr$lambda[order(rmse)[1]]
lam.best
[1] 0.1047125
coef(lasso.tr,s=lam.best)
21 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 7.264584e+01
AtBat -1.995523e+00
Hits 7.852366e+00
HmRun 7.903955e+00
Runs -1.108838e+00
RBI -1.183504e+00
Walks 4.858045e+00
Years 4.486518e+00
CAtBat 1.095725e-06
CHits -2.783669e-01
CHmRun 2.405498e-01
CRuns 1.141320e+00
CRBI 2.478752e-01
CWalks -5.165415e-01
LeagueA -5.968549e+01
LeagueN .
DivisionW -9.931479e+01
PutOuts 2.733559e-01
Assists 9.949583e-02
Errors 3.160942e+00
NewLeagueN -5.628500e+00