Data Scientists frequently build Machine Learning models to discover interesting (rare) events in data. These events can be valuable (e.g., customer purchases), costly (e.g., fraud), or even dangerous (e.g., threat). Finding them is a “needle-in-a-haystack” challenge: the events are rare and hard to distinguish from the huge mass of overwhelmingly uninteresting cases recorded. To differentiate rare from normal events it helps to have a good understanding of normal behavior. But, how well do you actually know the haystack?
The Basics of Binary Classification
Most practical classification problems involve two outcomes, for example:
- Is this transaction fraudulent or not?
- Will this customer buy or not?
- Are the symptoms observed indicative of disease or not?
Data Scientists expend significant time and effort exploring and preparing data, engineering features, and tuning models to estimate probabilities of such rare events. We label such cases “1” or “TRUE” in the output variable. A common mistake we encounter is clients providing a data sample that contains only these types of cases — not knowing that it is vital to also include observations of “normal” events. In other words, they forget the zeroes. But the model needs both types to learn to distinguish between them.
Defining the Zeroes: Quantifying Normal Behavior
In the science-fiction movie Arrival, humans encounter an alien intelligence that communicates using a visual language based on patterns and shapes. SPOILER ALERT: A key turn in the drama happens when the scientists realize that they were focusing on the shapes the aliens were using to communicate, but it was the empty space between the shapes that followed a mathematical sequence. In other words, the negative space in the image was equally (if not more) important than the signal in understanding the alien’s communication.
This concept of negative space is important in graphic design. Artists make intentional use of whitespace to communicate their message. Separation between images can create tension or visual flow. The arrangement of features can draw the eye from one element to the next. In data science, the zeroes in our data are the whitespace. But how well do they represent what we would intuitively (or qualitatively) understand as “normal” given what we expect our model to see in reality?
Fundamentally, a successful model identifies events of interest and behaves as expected on “normal” cases. In other words, it has multiple specifications:
- Ideal: how the model will behave under perfect circumstances
- Design: how the model will behave, given real-world constraints and technical capabilities
- Revealed: how the model actually behaves once it is built and put into production
In practice the revealed specification is arguably the most important. Regardless of the outcome we expect from the model, the revealed specification is how it scores real events; that governs decisions that are made and actions that are taken. Data Scientists have several options available to estimate model performance in terms of the revealed specification before launching the model into production. For example, profiling the data and getting descriptive statistics for key fields can help gauge how representative the distributions of values are relative to expectation. Robust cross validation can establish the expected variation in model outcomes when exposed to variation in input data.
It is especially important during model design to establish a baseline understanding about missing data, since gaps in data are often unexpected (but frequently occur).
- How many fields have missing data?
- How are the data missing? Are there long gaps? Or sporadic outages?
- Most important, why are the data missing (especially for “normal” cases)? Especially, does it appear to be missing at random, or is there a hierarchical relationship in the missing data?
The process of profiling or exploring normal cases is likely to encounter confirmation bias, especially among subject matter experts. Either, the data will confirm prevailing understanding, or a case might be rejected as erroneous for countering common knowledge (or assumptions) about the norm.
Consider an example of software user-behavior analysis to inform future development. In this context, profiling to understand the zeroes can be very helpful to confirm (or disprove) commonly held assumptions about how users interact with the software tool. All too often, product managers or developers will be anchored on past conversations with vocal customers or superusers who are not representative of the general population of users. While those opinions are important, they are potentially given too much weight just because they are memorable (and not necessarily indicative of a norm).
By gaining a better understanding of the common user through data analysis, it is possible to drive initiatives and decisions around the most common cases (or confirm activity that is already in progress). Once normal behavior is defined, it is easier to identify an event of interest that is truly anomalous.
Hits and Misses: the Cost of a False Positive
Another reason it is vitally important to quantify the behavior of the zeroes in the modeling dataset is that models will see many more 0’s in production than 1’s. It is vital to understand the cost of a mistake.
The costs of different errors (e.g., False Positives and False Negatives) are almost never equal. But humans are very poor estimators of the consequences of different decisions, especially when rare events are involved. In Thinking Fast and Slow, Daniel Kahneman relates an example where experienced forensic psychologists and psychiatrists produced startlingly different recommendations based upon how a rare event was presented to them in two different ways:
- Patients similar to Mr. Jones are estimated to have a 10% probability of committing an act of violence against others during the first several months after discharge.
- Of every 100 patients similar to Mr. Jones, 10 are estimated to commit an act of violence against others during the first several months after discharge.
When experienced mental health professionals were given the second presentation, they were twice as likely to deny discharge to the patient. If this is the same information, why was there such a difference? Kahneman offers a theory: “The probability of a rare event is most likely to be overestimated when the alternative is not fully specified” (emphasis ours). In other words, the second presentation of facts raises the availability of the rare event in such a way that psychiatrists pay more attention to it than when it is presented as a probability. The rare event seems more likely, although it is no more prevalent. Using the framing we have here, it focuses the experts on the 1’s, without adequately specifying the acceptable normal condition of the 0’s.
Returning to our binary classification problem, it is important to accurately estimate the business cost of a false positive, since it will adversely impact otherwise normal events. For example, consider classifying transactions as fraudulent. The cost of fraud can be very high, so a business is usually willing to accept the classification of some normal transactions as fraudulent (false positive) in order to miss fewer true 1’s (a false negative). However, there is a cost of allowing a normal transaction to be labeled as fraudulent. Perhaps it is just negative feedback to a customer service line (cost of service). It could become lost business if the customer decides not to move forward with the transaction because of the extra effort. If the customer encounters repeated errors, they may walk away entirely, significantly reducing their lifetime value to the firm. Understanding the zeroes means, in part, understanding the cost of falsely accusing them of not being normal.
A Well-Designed Model Metric: Beyond Accuracy
It is common to assess the quality of a predictive model in terms of “accuracy.” But relying solely on accuracy has problems. For example, consider a bank interested in building a model to identify high net-worth individuals. Say these customers make up 0.5% of the total population of customers. If we created a model that always returned “Not High Net Worth,” then the accuracy of our trivial model will be 99.5%! Yet it delivers no valuable insight to our banking client. This is why maximum accuracy classification may not be the right optimization for this problem. Rather, we want to maximize profit (by capturing the most high net-worth individuals) while minimizing cost (from falsely rejecting them).
To counter the limitations of using accuracy as a metric, Data Scientists often use metrics such as precision and recall to quantify how efficiently a model performs in terms of classifying items as positives (in terms of True and False Positives), as well as identifying true positives (versus classifying as False Negatives), respectively. Both can be very helpful in terms of understanding model quality. For example, if we find that a model has low recall, it may indicate that the feature set we trained the model on had insufficient signal to distinguish the 1’s from 0’s.
As often as we see model performance reports that cite Precision and Recall, we almost never see Negative Predictive Value. Specificity (i.e., True Negative Rate) is a familiar metric for Data Scientists, and quantifies True 0’s over all things classified as 0 by the model. Incorporating these metrics into model performance reports would speak to the quality of a model on the 0’s, and would begin to tell the story about how a model might treat “normal” cases that it would see in production, and whether changes may be warranted before a model is deployed.
Fundamentally, to achieve the greatest business value, Data Science teams should explore a diversity of model performance metrics and make sure those metrics support the business’ key performance indicators (KPIs).
Summary
A fundamental principle of good graphic and visual design considers both the image and the whitespace (negative space) around it. Data Science and Machine Learning mirror this concept in the classification of the interesting 1’s and normal 0’s by binary classification models. The 1’s are infrequent events that are difficult to manually identify but are often very valuable (or very costly). A well-designed model identifies these important 1’s but must also behave as expected on 0’s.
Often the “understanding” of baseline (normal) behavior of a business process is built on assumptions or common wisdom that may be incorrect, insufficient, or out of date. It is important that stakeholders of analytics results validate a range of metrics to gain a better understanding of the pieces of model quality and ensure those metrics support business KPIs. Ultimately, a machine learning model is designed with a distinct purpose. How well it meets that intent depends upon how well it treats the negative space around the rare events it was constructed to identify.