Life is usually simple, when you know only one or two techniques. One of the training institutes I know of tells their students – if the outcome is continuous – apply linear regression. If it is binary – use logistic regression! However, higher the number of options available at our disposal, more difficult it becomes to choose the right one. A similar case happens with regression models.
Within multiple types of regression models, it is important to choose the best suited technique based on type of independent and dependent variables, dimensionality in the data and other essential characteristics of the data. Below are the key factors that you should practice to select the right regression model:
- Data exploration is an inevitable part of building predictive model. It should be you first step before selecting the right model like identify the relationship and impact of variables
- To compare the goodness of fit for different models, we can analyse different metrics like statistical significance of parameters, R-square, Adjusted r-square, AIC, BIC and error term. Another one is the Mallow’s Cp criterion (see below). This essentially checks for possible bias in your model, by comparing the model with all possible submodels (or a careful selection of them).
- Cross-validation is the best way to evaluate models used for prediction. Here you divide your data set into two group (train and validate). A simple mean squared difference between the observed and predicted values give you a measure for the prediction accuracy.
- If your data set has multiple confounding variables, you should not choose automatic model selection method because you do not want to put these in a model at the same time.
- It’ll also depend on your objective. It can occur that a less powerful model is easy to implement as compared to a highly statistically significant model.
Regression regularization methods(Lasso, Ridge and ElasticNet) works well in case of high dimensionality and multicollinearity among the variables in the data set.
What is Mallows’ Cp?
Use Mallows’ Cp to help you choose between multiple regression models. It helps you strike an important balance with the number of predictors in the model. Mallows’ Cp compares the precision and bias of the full model to models with a subset of the predictors.
Usually, you should look for models where Mallows’ Cp is small and close to the number of predictors in the model plus the constant (p). A small Mallows’ Cp value indicates that the model is relatively precise (has small variance) in estimating the true regression coefficients and predicting future responses. A Mallows’ Cp value that is close to the number of predictors plus the constant indicates that the model is relatively unbiased in estimating the true regression coefficients and predicting future responses. Models with lack-of-fit and bias have values of Mallows’ Cp larger than p.
Using Mallows’ Cp to compare regression models is valid only when you start with the same complete set of predictors.
If any predictor is highly correlated with another predictor, Mallows’ Cp is not displayed in the output.
Example of using Mallows’ Cp to evaluate a model
For example, you work for a potato chip company that examines the factors which affect the percentage of crumbled potato chips per container. You include the percentage of potato relative to other ingredients, cooling rate, and cooking temperature as predictors in the regression model.
|Step||%Potato||Cooling rate||Cooking temp||Mallows’ Cp|
The results indicate that the model with the two terms “%Potato” and “Cooling rate” is relatively precise and unbiased because its Mallows’ Cp (2.9) is closest to the number of predictors plus the constant (3). You should examine Mallows’ Cp in conjunction with other statistics included in the results such as R2, Adjusted R2, and S.