Rice Analytics

Automated Reduced Error Predictive Analytics

Case Studies

Some of the unique aspects of what is now called Reduced Error Logistic Regression (RELR) have been employed with real business data in research and consulting applications for over ten years.  However, we have refined and improved the method in important ways in the past few years, so it is now a generalized machine learning method with an extremely sophisticated variable selection method called ParsedRELR.  These cases below are just a sampling of case studies during the most recent few years.  Many of these case studies have been presented by us or our users at conferences.  We are always interested in supporting anyone who may wish to present a RELR case study at a conference. Yet, we also recognize that most of our customers cannot publicly disclose the strategic advantages that RELR affords their business.

 

Using RELR for Media Research A very well known major supplier of syndicated media research and customized analytics decided to test RELR to ascertain if it could solve a basic problem that they had.  This basic problem was that their models were unstable across independent samples and therefore not interpretable.  After a month of testing RELR, they came back to us and told us that they were very impressed with the stability and interpretability of the ParsedRELR models compared to all other methods that they have used including Stepwise and Penalized Logistic Regression.  They also told us that they really liked the fact that it was completely automated, as this was a major labor cost savings for them.  On that basis, they decided to move immediately to long term licensing of RELR for their advanced analytics.    
    
Using RELR for Credit Scoring A number of different banks and financial firms have applied RELR to credit scoring scoring applications.  The most impressive result to date is that a user reports that RELR lifted the KS Statistic from roughly 40 to 65 compared to other methods.  This was possible because RELR was able to screen roughly 80,000 candidate variables and run accurately with a small sample size that was about 3000 observations.  The other methods either could not handle that number of variables and/or could not get an accurate solution with that number of variables at any sample size. This user commented that this kind of lift in performance was definitely an extraordinary result in their modeling practice.  Other noteworthy comments from these credit scoring applications include that RELR's variable selection seems to select meaningful variables and that RELR's variables were definitely more statistically significant compared to stepwise logistic regression and that RELR's automatic abilities were an advantage.   One of these credit scoring RELR users was from Premier BankCard. These results were presented at the SDSUG conference sponsored by SAS in April, 2009.  
  
Average Squared Error of Reduced Error Logistic Regression and Other Comparisons to Standard Methods Validation sample average squared error in Reduced Error Logistic Regression was compared to Penalized Logistic Regression (PLR) and five other standard methods in models developed from a Pew 2004 U.S. Presidential Election Weekend Poll dataset.  RELR had significantly lower validation sample average squared error compared to all methods.   In addition, RELR's model stability was compared to PLR.   The reliability of the regression coefficients between models built from  two independent samples of roughly 1000 observations was .95 with RELR, whereas it was only .26 with PLR.  This extremely high reliability with RELR suggests that the regression coefficients are not simply an artefact of the sample chosen, but are instead very reliable.  This study also employed Parsed RELR - a sophisticated parameter reduction method - and showed that the identical 9 variables with similar regression coefficients were selected in two independent samples with Parsed RELR models that had similar accuracy as the Full RELR models.  These Parsed RELR models had a large amount of face validity, as there is substantial agreement on the most important variables in recent U.S. presidential election outcomes.  Part of these findings were presented at the 2008 Joint Statistical Meetings in Denver, Colorado on August 6, 2008.  While a copy of this paper can be downloaded from the  Papers and Presentations page of this website, its full reference is: Rice, D.M. (2008). Generalized Reduced Error Logistic Regression Machine - Section on Statistical Computing: JSM Proceedings 2008, pp. 3855-3862.    
 
Reduction of Sample Size in a Customer Satisfaction Survey: A typical marketing research problem is to determine the relative importance of a large set of customer satisfaction attributes that determine overall customer satisfaction. Reduced Error Logistic Regression was employed to this end with a survey that consisted of 1,000 online respondents who rated 23 highly correlated different attributes of their financial advisor such as trustworthiness, proactive financial planning, useful advice etc. In addition, these respondents rated their overall satisfaction with their financial advisor.   RELR was able to build a reliable and valid model that predicted overall satisfaction based upon these attributes using only a sub-sample of 100 of respondents. The reliability and validity of this model could be empirically verified with independent samples of 100 taken from the original 1,000. The original sample size of 1,000 was employed because this is about the number of observations that are required with this many variables with the standard regression-based modeling. RELR reduced this cost by 90%. These results were presented at the 2006 Psychometric Society Conference in Montreal, Canada.
 
Linkage of Survey Measures to Spending Behavior in Las Vegas Shoppers: A typical marketing research problem is to link measures of customer satisfaction to business outcomes related to loyalty and spending. RELR was employed to this end with a loyalty and spending survey funded by Shop America and Fashion Outlets. The respondents were tourists in Las Vegas who took a shopping tour at the Fashion Outlets-Las Vegas shopping center. 290 people participated. The surveys were administered during the return bus trip from the shopping center back to the Las Vegas strip. Respondents were asked about 49 relatively correlated attributes related to their satisfaction, whether they spent as much money as planned, and whether they would recommend this tour to a friend. RELR was able to build a reliable and valid predictive model that uncovered attributes related to the importance of the bus driver and how he/she promotes the shopping center. In addition, the time that the shoppers were allowed to be at the mall turned out to be very important. Based upon a well known 10:1 rule that says that “for every 10 target category responses you can include one variable” in logistic regression, a standard logistic regression model would have required at least 10 times the number of respondents as this survey required for a reliable Reduced Error Logistic Regression predictive model. These results were presented at the 10th Annual Shop America Conference in Las Vegas in 2007.
 
Risk Management of Mutual Fund Flows: A fundamental problem in the mutual fund industry is to understand the most important drivers of fund flows. RELR was employed to this end by a risk management firm for one of its major Fortune 500 Mutual Fund client companies. 84 months of data were available going back to the Year 2000 for this fund. A large number of possible drivers were used as variables that included seasonal factors, overall corporate fund flows, NAV data reflecting investor returns, fund volatility, and media measures of corporate and fund reputation that this firm sells to its client base. There were several hundred input variables that included nonlinear and interaction terms derived from these variables. The “best model” as determined by RELR’s automatic variable selection methods only involved 4 linear variables. This model was a succinct explanation of how a fund’s media reputation could interact with investment performance history to determine fund flows. More importantly, it uncovered a very simple and potentially causal description of how investors choose a mutual fund based upon a small set of criteria. Because RELR is not a “black box” technique, the interaction variables in this choice model were easy to understand and can be manipulated in future media planning involving this mutual fund.   A standard logistic regression model based upon these several hundred multicollinear independent variables would have required at least 6,000 months of data for reliable variable importance measurement and would not have been possible.   
 
Machine Learning of Parameters Necessary to Automate a Psycholinguistic Scoring Task: A media measurement firm employed Reduced Error Logistic Regression to help them move from human manual scoring of media text to an automated system.  RELR reduced the number of important variables that need to be considered to a very small number and therefore reduced the complexity of an automated solution significantly.  
    
Comparison to Five Standard Predictive Modeling Methods The classification accuracy of Reduced Error Logistic Regression was compared to Support Vector Machines, Partial Least Squares, Decision Trees, Neural Networks, and Standard Stepwise Logistic Regression in two independent datasets.  One dataset was a Pew Research 2004 Election Weekend dataset; the other dataset was a technical trading dataset for Patterson Energy (Nasdaq: PTEN).   The binary dependent variable responses in both of  these datasets were completely balanced, so Misclassification Rate was taken to be a valid and easily interpretable measure of model accuracy.  In the Pew dataset, RELR had a significantly lower Validation Sample Misclassification Rate than all other methods.  In the Patterson Energy dataset, there were no significant differences between the methods.  Because the Patterson Energy model had very few important variables that were easy to find and the Pew model had many potentially important and correlated variables where it was difficult to find the most important variables, these findings suggest that RELR's advantage might be specific to predictive models with many many potentially important candidate variables where it is difficult for a standard method to find the most important variables.  These findings were presented at the SAS M2007 Conference at Caesar's Palace in Las Vegas in an invited session on Reduced Error Logistic Regression and the classification results concerning the Pew dataset were also published in the JSM 2008 paper referenced above. 
     




Machine Learning  Segmentation  Consumer Surveys  Predictive Modeling  Risk Management