Classification scorecards are a great way to predict things because the techniques used in the banking industry specialize in interpretability, predictive power, and ease of deployment. The banking industry has long used credit scoring to determine credit risk—the likelihood a particular loan will be paid back. A scorecard is a common way of displaying the patterns found in a classification model—typically a logistic regression model. However, to be useful the results of the scorecard must be easy to interpret. The main goal of a credit score and scorecard is to provide a clear and intuitive way of presenting regression model results. This article briefly discusses what scorecard analysis is and how it can be applied to score almost anything.
Scorecards are extremely successful in the consumer credit world because they:
- Are very generalizable – anyone in an organization can understand and use them
- Are accepted by regulatory agencies as a standard method for presenting credit risk
- Are straightforward to implement and monitor over time
- Can be quickly programmed and deployed in a mass way
Strict credit industry regulations protect consumers from loan rejections based on uninterpretable “black box” models. There are often also laws against using particular variables, such as race or zip code, in the credit decision. This has driven the banking industry to develop models with results that can be easily interpreted. However, the goal of interpretable model results goes well beyond just banking.
Let’s look at a credit scorecard model example that employs three variables: age, income, and home ownership. Each variable has a value, or level, that contributes scorecard points, as shown in Figure 1. The points are summed together and if they exceed the threshold the applicant is approved for a loan:
Credit Approval Threshold ≥ 500
Example 1
- AGE = 32 => 120 score
- OWNERSHIP = OWN => 225 score
- INCOME = $30,000 => 180 score
- Credit Score = 120+225+180 = 525 => Loan Approved
Example 2
- AGE = 22 => 100 score
- OWNERSHIP = OWN => 225 score
- INCOME = $8,000 => 120 score
- Credit Score = 100+225+120 = 445 => Loan Rejected
These simple variables provide clear guidelines for decision makers, making loan approval decisions transparent and easy to interpret and discuss or defend.
Data Scientists often build models where the client must interpret the results or understand the factors driving the model results. In these cases scorecard modeling provides an easy solution. With scorecards, there are no continuous variables – every variable is categorized. This is the most important step. The key to scorecard models is to categorize or “bin” the variables in a way that summarizes as much information as possible, and then build models that eliminate weak variables or those that do not conform to good business logic.
Binning simplifies many analysis issues that are complex for linear models, since:
- There is a direct relationship between group membership and points, instead of an indirect relationship between model coefficients and variable values.
- Groups can reflect nonlinear relationships; there is no need to worry about linearity assumptions – one of the biggest problems with logistic regression.
- Grouping handles outliers well because the outliers can be contained in the smallest or largest group, whereas in linear models the outliers affect the estimates everywhere.
- Missing values are easily handled by being assigned to their own group.
As a starting point for binning, select a continuous variable and divide it into groups that are most alike. This can be done with decision trees, Gini statistics, Chi-square tests, random statistical buzzwords, etc. – whatever method you’d like to use to categorize individual continuous variables. Imagine starting with many bins for a continuous variable. Then look to see which bins can be combined statistically. Having a continuous variable assumes that all levels are different when predicting a target variable. Most of the time this is not true. Is there a difference between someone with an income of $38,000 and someone with $39,000? Most likely not, but treating income as a continuous variable makes this assumption. By categorizing we can let the computer decide if there is a statistical difference; if there isn’t, they can be combined in the same category.
Once everything is binned there may be a large number of categories – probably too many for modeling. To resolve this we calculate and examine the key assessment metrics using three tools:
- Weight of Evidence (WOE): Determines how well attributes discriminate for each given characteristic
- Information Value (IV): Evaluates a characteristic’s overall predictive power
- Gini Statistic: Alternate to Information Value (IV) for selecting characteristics for the final model.
Weight of Evidence is key in scorecard modeling projects because it helps to determine how well particular attributes are at separating good and bad accounts – our binary target variable. WOE can measure this on a category by category basis, and is based on comparing the proportion of good accounts to bad accounts at each attribute level of the predictor variable.
Consider an example of a FICO score. Using decision trees the FICO score is divided into 10 groups that are “optimized” based on their ability to predict loan defaults in the same way as the previous examples. As shown in Figure 3, there are people who did and did not default in each group.
Weight of Evidence is key in scorecard modeling projects because it helps to determine how well particular attributes are at separating good and bad accounts – our binary target variable. WOE measures this on a category-by-category basis, and compares the proportion of good accounts to bad accounts at each attribute level of the predictor variable.
Consider an example of a FICO score. Using decision trees the FICO score is divided into 10 groups that are “optimized” based on their ability to predict loan defaults in the same way as the previous examples. As shown in Figure 3, there are people who did and did not default in each group. WOE is a simple way of comparing these two groups by calculating the log of the ratio between the proportion of good loans and bad loans. In the example histogram (Figure 4), the red bars represent the proportion of bad loans within the total population. In this example, 14% of all people with a FICO score of less than 610 defaulted on a loan. In contrast, 4% of all people who did not default on a loan (blue bar) had a FICO score of less than 610.
With a WOE calculation, negative numbers demonstrate that bads outweigh goods, while positive numbers demonstrate that goods outweigh bads. The larger the WOE, the better that category will predict your target variable (e.g. good loans). This technique is also great for handling missing data, for example, if someone didn’t report their FICO score. With scorecards those individuals get their own category. In fact, as shown in Figure 5 we can say that people who did not report their credit score look a lot like those who have a FICO score around 630-653.
The scorecard is typically built on a logistic regression model where the original inputs are replaced by the WOE calculation. Essentially, the WOE column in Figure 3 now becomes the input to your model. This is done for every variable so your model is full of WOE representations of variables treated as continuous. You are converting a continuous variable into a categorical variable, assigning a value for each category, and then treating it as numeric variable again. But that numeric variable is no longer “age is 25” or “income is $20K”; rather, it is the propensity to default, represented by each individual category.
So why go through all of this math if we are still going to use a logistic regression? Because all of the variables are now on the same scale. Therefore, the coefficients from the logistic regression model can be used to directly compare which variables have more “influence” on the outcome. The further from zero, the more important the variable is to the model outcome.
The final step in our example is to determine the ranking system for each FICO group. Using the coefficients from the logistic regression plus the WOE we can calculate the scores (points) for each category. The points for each category of a variable are calculated by multiplying each variable’s coefficient by the WOE values for each category followed by a couple of adjustment factors to get the scaling of the scores down. These scaling factors help define the range of possible values for the scorecard. For example, a person with a FICO score of 757 would score 123 points as shown in Figure 6. This point value is added to the points for the other characteristics that comprise the overall credit score to determine their loan eligibility.
Although this process includes a lot of technical detail, the result is a model outcome that can be easily interpreted – the gold that we all search for in modeling. Empirical studies have shown that logistic regressions and scorecard models that use logistic regression have the same predictive power. You get improved interpretability, the same predictive power, and can easily handle outliers and missing values. Now that scores high in my book! Or model. Whichever you prefer.