|
Breiman’s Quiet Scandal: Stepwise Logistic Regression and RELR
Daniel M. Rice
Rice Analytics, St. Louis MO
August 9, 2011
Copyright © 2009-2011 Rice Analytics. All Rights Reserved. The introduction to a previous version of this article with a link to the full article was originally published in the analytics industry newsletter KDnuggets.com on August 27, 2009 (issue 09:n16)). This article is now updated to present new results.
Introduction
Leo Breiman, one of the most influential statisticians of recent memory, referred to the model selection problem that is apparent in stepwise logistic regression as the “quiet scandal” of statistics (Breiman, 1992). One problem is that arbitrary criteria are used to arrive at the stepwise model, such as an arbitrary cutoff involving the statistical significance of a variable’s regression coefficients. Additionally, there is no attempt to model and reduce error in regression coefficients, so regression coefficients and their statistical significance can be quite unreliable across independent samples unless the sample size is very large. With arbitrary and unreliable selection criteria, entirely different variable sets will be selected by different modelers and by different samples. Also, the processing time in stepwise logistic regression makes it infeasible to model interactions and any large number of variables. Hence, as hinted in Breiman’s famous chiding remark, stepwise logistic regression is notorious for giving arbitrary and unreliable models that may completely miss important variables. Unfortunately, there has been no better alternative that overcomes these problems and still gives a parsimonious model. Thus, most businesses still use stepwise logistic regression to model probability or risk in applications such as credit scoring, insurance risk, pharmaceutical treatment outcomes, consumer attitudes, marketing response, and customer satisfaction where there is a desire to have a transparent model with few variables.
Recent evidence suggests that Reduced Error Logistic Regression (RELR) represents a much better alternative. RELR is a very general regression modeling algorithm that is useful in all conventional ways that logistic regression is used, but may also be used for predictive applications traditionally performed by Survival Analysis and Least Squares Regression, such as Survival Time prediction and Forecasting. RELR models and reduces error as part of the maximum likelihood solution, so its regression coefficients are very stable across independent samples. Also, there are no arbitrary criteria involved in the Parsed RELR variable selection that returns the parsimonious solution that is the super maximum likelihood solution across variable sets, so different modelers will generate the identical model given the identical training data. Because RELR's parsimonious variable selection is the super maximum likelihood solution, it is readily interpretable as the most probable solution. Additionally, RELR allows the modeling of interactions and a very large number of variables. For these reasons, RELR is much less susceptible to the reliability and interpretive validity problems surrounding stepwise logistic regression.
This may be especially important in the United States in the increasingly regulated financial, insurance, health, pharmaceutical and automobile industries. In these industries, logistic regression models of probability and risk ultimately determine the nature of the product or service offered and who may purchase. The large failure of probability and risk modeling in many of these same industries is now viewed as at least a contributing factor to the financial risk modeling problem that resulted in the 2008-2009 recession. Hence, arbitrary and unreliable methods like stepwise logistic regression will now be even more difficult to defend. Thus, business managers and statisticians will need to consider any better alternative and RELR is such a better alternative.
The earliest research prior to 2009 suggested that RELR and stepwise logistic regression may have comparable classification accuracy in easier problems that have very large sample sizes and relatively few input variables that do not have important interaction or nonlinear effects. Yet, the standard implementation of the RELR algorithm was changed slightly in 2009, so intercepts were computed directly. Since that time, RELR can outperform in classification accuracy in "tall problems" with relatively few variables in relation to a large sample size. For example, RELR's parsimonious variable selection algorithm called Parsed RELR has now been observed to show better classification accuracy performance in such "tall problems" not only incomparison to stepwise, but also in comparison to other newer regression algorithms such as LARS, LASSO and Random Forests Logistic Regression (Ball, 2011). Yet, RELR's biggest accuracy advantage will always be most apparent in validation sample measures of model error and classification accuracy in more difficult high dimensional "wide problems" involving large numbers of input variables, especially when important interaction effects and/or nonlinear effects are present.
Intuitively, one would
think that the more information one has, the better would be the
prediction. For example, if I know 100
different things about a group of people like their state, city, county,
religion, brand preferences etc., then I should be able to get a better
prediction of their vote than if I only knew the state in which they resided.
Even if only a few of these 100 variables were important in the end, it should
be better to have put all 100 variables in the model, so we can at least select
the most important variables from this pool.
Unfortunately, statistics is not this intuitive, as predictive models can get much worse as you add more variables. This problem is especially apparent in datasets with small sample sizes, large numbers of independent variables, correlated independent variables, nonlinear variables, and highly unbalanced target variables. With too many variables and too small of a sample, any attempt to build a reliable predictive model can be a problem because of this blurring of correlated predictor variables. This “multicollinearity error” is directly a function of too many correlated variables. In general, there will be a much higher likelihood of correlated independent variables with more variables, so multicollinearity almost always seems to be a problem with high dimensional data. With multicollinearity, the model can have severe overfitting problems because the obscured variable importance measurement forces too many variables in the model even after variable selection. Breiman (1992) suggested that the reason for this was that it is rare that all truly important variables are measured in data that go into regression models. Hence, the regression tends to be biased and overfit the selected variables beyond their true contribution with the added cost of sometimes having regression coefficients with the wrong sign. The end result is that the predictive model’s “out of sample” validation performance can be quite poor.
We and our users have presented a number of public results over the past few years that show that RELR appears to avoid multicollinearity problems (Rice, 2006; Rice, 2007; Rice, 2008; Rice, 2009,, Pruitt, 2009, Ball, 2011). Taken together, these results show that RELR’s regression coefficients do not exhibit inflated magnitudes and do not have the wrong signs and RELR performs very well with high dimensional data and correlated variables. Because we do not need to worry about multicollinearity problems, this allows us to build highly accurate predictive models rapidly based upon tens of thousands and even potentially millions of variables. RELR can handle this number of variables rapidly because it knows the most important variables in a model prior to running the model, so RELR builds models based upon the shortlist of most important variable with no loss in accuracy. RELR ultimately selects the very small number of most important and meaningful variables in a final production-level explanatory model using the Parsed RELR variable selection method, as Parsed RELR models often may have fewer than 10 variables. We will review RELR at a very high executive level in this article. A technical article (Rice, 2008) is also available that goes into detail about the RELR algorithm. |
|
Multicollinearity is Breiman's Quiet Scandal Monster
Multicollinearity has been the 1000 pound Monster in statistical modeling. Problems related to multicollinearity error are seen in all predictive modeling approaches and not just in logistic regression. The overfitting error that is associated with multicollinearity can be very costly in business and science applications. Yet, taming this monster has proven to be one of the great challenges of statistical modeling research.
Variable selection or reduction such as stepwise selection has been the most common approach to avoid multicollinearity, but optimal variable reduction and selection requires accurate assessment of relative variable importance. Unfortunately, this assessment of variable importance is itself corrupted by multicollinearity. Hence, multicollinearity makes optimal variable reduction and selection very difficult in standard regression modeling; this is the problem with stepwise logistic regression.
One may average correlated variables together as in principal component factor analysis to decrease the effects of multicollinearity, but the averaged factors are usually difficult to interpret and the accuracy of the predictive model is often compromised. Somewhat related to the smoothing or averaging of variables together that we see in principal component analysis are “regularization” approaches such as Ridge penalized and LASSO logistic regression. In fact, principal component analysis, along with other forms of factor analysis, has been suggested to be just a form of regularization (Ramsey, 2005). In Ridge penalized and LASSO logistic regression, the maximum likelihood solution is computed using a penalty term that forces solutions to be regularized according to criteria that minimize the magnitude of the regression coefficients and thus avoid the large magnitude regression coefficients seen with multicollinearity. These methods have significant problems though. One problem is that the solutions are often difficult to interpret because the regularization is arbitrary. Because of this, the reliability of regression coefficients may be quite poor across independent samples of observations as we have shown with Ridge penalized logistic regression (Rice, 2008). Another problem is that one needs to observe the validation sample in order to optimize smoothing, or else the Ridge or LASSO regularization is likely to be inaccurate. Though, even when one observes the validation sample, the measure of accuracy which determines the regularization can have a marked effect on the nature of the model, as whether one uses the ROC AUC, average squared error, or classification accuracy will determine the form of the regularized model. Which measure should one use to validate the regularization and should one use Ridge or LASSO regularization or some combination,such as the new Elastic Net algorithm? Nobody ever knows because, unlike RELR, these regularization methods are all entirely arbitrary and have no basis in probability theory.
|
|
Another traditional effective approach to deal with the arbitrary and unreliable variable selection in stepwise logistic regression is to employ model averaging such as Bayesian Model Averaging or other ensemble modeling approaches. This method would simply average together all of the different models that were produced by different modelers or by different selection criteria in stepwise logistic regression or by completely different algorithms. The end result is an average model that overcomes the reliability problems in stepwise logistic regression. The validation sample accuracy often seems to be much better than stepwise logistic regression, as overfitting is avoided. However, the problem with model averaging is that the model is no longer parsimonious, but instead can involve a much larger number of variables than any individual stepwise model, so there are major interpretive difficulties just to understand how these models work. These “ensemble” approaches are often quite accurate as evidenced in their success in the Netflix and Jeopardy competitions, but they are effectively black box solutions.
A final traditional approach to avoid multicollinearity error is simply to increase the sample size. This will always work. With a large enough sample size, we can guarantee that we will have greatly diminished problems with multicollinearity error. This is a very expensive solution to multicollinearity, so it is rarely if ever a viable solution. Yet, it is a very important clue to how we might fix the problem.
Given this connection to small sample sizes, it would seem reasonable to suspect that multicollinearity error is a function of the higher margin of error of correlated predictor variables in smaller sample sizes. Our research does indeed support this view. More importantly, this research suggests that appropriate constraints can be embedded into the computation of predictive models to reduce modeling error such as the sampling error associated with regression coefficients, along with classification error. Hence, reliable and valid predictive models can be built based upon either relatively small sample sizes and/or a relatively large number of potentially correlated predictor variables, although RELR's accuracy advantage can still be observed in large sample sizes and with relatively few variables.
|
|
Please click on the following link for an immediate pdf download of the entire article, including the remainder not shown on this page.
|
| | |