LASSO regression is a great tool to have in your data science arsenal if you’re working with big data. It’s computationally efficient and performs variable selection and regression simultaneously. How powerful is that?! In this article, we’ll talk about when you want to use this powerful tool for modeling over multiple linear regression.
1. You want a sparse model.
The first case for using LASSO over multiple linear regression is when you want a sparse model. In practice, a sparse model can take a number of forms. The most ‘classic’ case is that you have a large set of variables, but only a small number of them are truly important. But this isn’t always the case. For example, all of your variables might be important, but within a local region, only a few are necessary. As an example, let’s say you have hyperspectral imaging data where you have wavelengths that are highly correlated with each other. Since they’re so highly correlated, you really only need to select one variable within that highly correlated set. Basically, this one variable acts as a delegate for all the variables it’s highly correlated with. Finally, it could also be that all your variables are important, but only a small number of variables explain the majority of the variation.
LASSO regression works well for sparse models since it’s built around the “bet on sparsity” principle. Essentially, this principle suggests that the “truth” must be sparse if we want to efficiently estimate our parameters.
“Use a procedure that does well in sparse problems, since no procedure does well in dense problems.”Hastie, Tibshirani, & Friedman (2015)
However, some people believe that the ‘truth’ is inherently dense and our models should account for this. It’s definitely a topic for philosophical debate!
2. n << p
When the number of predictor variables is much larger than the number of observations, you’ll want to choose LASSO regression over multiple linear regression. When n << p, this is known as the large p, small n problem. This is very typical of genomic data. With genomic data, each individual has tens of 1,000’s of genes. That means, just to get n equal to p, you’d have to collect thousands of samples. This doesn’t generally happen because that’s fairly expensive and takes a lot of work. That often means you’re left with a large p, small n problem.
What’s wrong with n < p? Essentially, if the true model isn’t sparse, we don’t have enough observations for an accurate estimation of our parameters. If n < p, least-squares will break down and we won’t get unique estimates (Hastie, Tibshirani, & Wainwright, 2015). Now, if we assume sparsity, or, if we assume that only a small subset of variables are important, we can use LASSO to shrink many of the coefficients to zero, leaving only the important ones in the large p, small n scenario.
3. You have some multicollinearity.
Finally, LASSO regression is useful when you have some multicollinearity in your model. Multicollinearity means that the predictors variables, also known as independent variables, aren’t so independent. With multiple linear regression, this can cause your coefficients to vary dramatically and throw off the interpretability of your model. Luckily, because of LASSO’s built-in variable selection, it can handle some multicollinearity without sacrificing interpretability. If the collinearity is too high, however, LASSO’s variable selection performance will start to suffer. If there are highly correlated or collinear predictors, it will only select one of them. You’ll know if your collinearity is too high if you get a different set of predictors each time you run LASSO. If you do find that your data has a lot of multicollinearity, try using an elastic net. It’s a hybrid of ridge regression and LASSO regression that works well when multicollinearity is high. Alternatively, you can hack it by simply running LASSO multiple times, keeping track of all the significant predictors for each run.
Multiple linear regression is a great tool for modeling a wide range of data, but it does have its limitations. Fortunately, LASSO regression is an excellent alternative for handling sparse models and big data. I hope you get a chance to try out some LASSO-ing yourself! Happy modeling!
Hastie, T.J., Tibshirani, R.J., and Friedman, J.H. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York.
Hastie, T.J., Tibshirani, R.J., and Wainright, M. (2015). Statistical Learning with Sparsity. CRC Press, Boca Raton, FL.