Breiman’s Quiet Scandal: Stepwise Logistic Regression and RELR
Daniel M. Rice
Rice Analytics, St. Louis MO
July 8, 2010
Copyright © 2009-2010 Rice Analytics. All Rights Reserved. The introduction to a previous version of this article with a link to the full article was originally published in the analytics industry newsletter KDnuggets.com on August 27, 2009 (issue 09:n16).
Introduction
Leo Breiman, one of the most influential statisticians of recent memory, referred to the model selection problem that is apparent in stepwise logistic regression as the “quiet scandal” of statistics (Breiman, 1992). One problem is that arbitrary criteria are used to arrive at the stepwise model, such as an arbitrary cutoff involving the statistical significance of a variable’s regression coefficients. Additionally, there is no attempt to model and reduce error in regression coefficients, so regression coefficients and their statistical significance can be quite unreliable across independent samples unless the sample size is very large. With arbitrary and unreliable selection criteria, entirely different variable sets will be selected by different modelers and by different samples. Also, the processing time in stepwise logistic regression makes it infeasible to model interactions and any large number of variables. Hence, as hinted in Breiman’s famous chiding remark, stepwise logistic regression is notorious for giving arbitrary and unreliable models that may completely miss important variables. Unfortunately, there has been no better alternative that overcomes these problems and still gives a parsimonious model. Thus, most businesses still use stepwise logistic regression to model probability or risk in applications such as credit scoring, insurance risk, pharmaceutical treatment outcomes, consumer attitudes, marketing response, and customer satisfaction where there is a desire to have a transparent model with few variables.
Recent evidence suggests that Reduced Error Logistic Regression (RELR) represents a much better alternative. RELR is a very general regression modeling algorithm that is useful in all conventional ways that logistic regression is used, but may also be used for predictive applications traditionally performed by Survival Analysis and Least Squares Regression, such as Survival Time prediction and Forecasting. RELR models and reduces error as part of the maximum likelihood solution, so its regression coefficients are very stable across independent samples. Also, there are no arbitrary criteria involved in the Parsed RELR variable selection that returns the parsimonious solution that is the super maximum likelihood solution across variable sets, so different modelers will generate the identical model given the identical training data. Because RELR's parsimonious variable selection is the super maximum likelihood solution, it is readily interpretable as the most probable solution. Additionally, RELR allows the modeling of interactions and a very large number of variables. For these reasons, RELR is much less susceptible to the reliability and interpretive validity problems surrounding stepwise logistic regression.
This may be especially important in the United States in the increasingly regulated financial, insurance, health, pharmaceutical and automobile industries. In these industries, logistic regression models of probability and risk ultimately determine the nature of the product or service offered and who may purchase. The large failure of probability and risk modeling in many of these same industries is now viewed as at least a contributing factor to the financial risk modeling problem that resulted in the 2008-2009 recession. Hence, arbitrary and unreliable methods like stepwise logistic regression will now be even more difficult to defend. Thus, business managers and statisticians will need to consider any better alternative and RELR is such a better alternative.
RELR and stepwise logistic regression may have comparable accuracy in easier problems that have very large sample sizes and do not involve important interaction variables or have relatively few variables. Yet, RELR can clearly outperform stepwise logistic regression in validation sample measures of model fit and accuracy in more difficult “high dimensional” problems involving large numbers of input variables, especially with important interactions. Intuitively, one would think that the more information one has, the better would be the prediction. For example, if I know 100 different things about a group of people like their state, city, county, religion, brand preferences etc., then I should be able to get a better prediction of their vote than if I only knew the state in which they resided. Even if only a few of these 100 variables were important in the end, it should be better to have put all 100 variables in the model, so we can at least select the most important variables from this pool.
Unfortunately, statistics is not this intuitive, as predictive models can get much worse as you add more variables. This problem is especially apparent in datasets with small sample sizes, large numbers of independent variables, correlated independent variables, nonlinear variables, and highly unbalanced target variables. With too many variables and too small of a sample, any attempt to build a reliable predictive model can be a problem because of this blurring of correlated predictor variables. This “multicollinearity error” is directly a function of too many correlated variables. In general, there will be a much higher likelihood of correlated independent variables with more variables, so multicollinearity almost always seems to be a problem with high dimensional data. With multicollinearity, the model can have severe overfitting problems because the obscured variable importance measurement forces too many variables in the model even after variable selection. Breiman (1992) suggested that the reason for this was that it is rare that all truly important variables are measured in data that go into regression models. Hence, the regression tends to be biased and overfit the selected variables beyond their true contribution with the added cost of sometimes having regression coefficients with the wrong sign. The end result is that the predictive model’s “out of sample” validation performance can be quite poor.
We have presented several studies over the past few years to show that RELR appears to avoid multicollinearity problems completely (Rice, 2006; Rice, 2007; Rice, 2008; Rice, 2009), as RELR’s regression coefficients do not exhibit inflated magnitudes and do not have the wrong signs. Because we do not need to worry about multicollinearity, this allows us to build highly accurate predictive models rapidly based upon tens of thousands and even potentially millions of variables. RELR can handle this number of variables rapidly because it knows the most important variables in a model prior to running the model, so RELR builds models based upon the shortlist of most important variable with no loss in accuracy. RELR ultimately select the very small number of most important and meaningful variables in a final production-level explanatory model using the Parsed RELR variable selection method, as Parsed RELR models often may have fewer than 10 variables. We will review RELR at a very high executive level in this article. A recently published technical article (Rice, 2008) is also available.