“It depends!” That was the phrase all of my former students hated, but knew was coming when they asked if they had a good model. I knew they wanted a “yes” or “no” answer, but I need more information if I am to adequately answer that question. I find that people like general rules that they can apply to problems to get simple answers. However, in data science you need to have perspective on the whole situation before deciding whether a model is good or bad. That is why you need a baseline with which to compare your performance!
I was never the best kid on the court when playing basketball, but I was also never the worst. In terms of basketball skill I consider myself rather average. However, what if I were to change the setting? If I was on a court with a bunch of second graders I would be an all-star! I would be running faster, jumping higher, scoring at will, and blocking every shot. Against NBA level talent, the roles would be reversed and I would feel like the second grader. Perspective and comparison to a baseline is key to grading performance in basketball, as well as with data science solutions. Is the new solution better than the old solution? What is the baseline comparison?
Model Evaluation or Comparison
The problem is in how people have been told to evaluate models and data science solutions. There are a variety of metrics for scoring whether a model is “good” or “bad” such as R2, percentage accuracy, mean absolute percentage error (MAPE), and many more. Each of these has advantages and disadvantages, but share one common trait – they are designed to compare, not evaluate performance in a vacuum. An accuracy of 70% falls short if compared to 90% accuracy, but looks great compared to a current standard of 30%.
In a recent project, the government client was unimpressed by our initial model results that predicted fraud with 40% accuracy. However, when we explored their baseline method we discovered that they randomly sampled claims for further investigation which only had a success rate of around 4.2%. This “unimpressive” model was almost 10 times better than their current approach! When presented with this information, they understood the true value that the model provided when compared to their existing method. Their mistake was a common one. “I can guess a coin better than 40% of the time” they would say. However, their current process wasn’t equivalent to flipping a coin—it was more like rolling a 24-sided dice.
I’m very proud of a former student who was recognized (and eventually hired by) a company at a conference because of his model results. His model could only explain about 50% of the variation in the data when predicting the company’s desired outcome. Happily, for both parties, the standard in the industry in which he now works is only 30%. The student’s model was nearly 1.7 times better than any model in the industry. In another application, 50% variation explained might not be good enough, but in his industry, it made him a star. It’s all revealed by knowing the right baseline.
Summary
Most things in life are difficult to evaluate in a vacuum—Darkness is the absence of light; hot means having a temperature higher than the norm. They all depend on context, perspective, and comparison. This is no different for data science and the results derived from predictive models. We should not hold models to a higher standard than is appropriate or necessary. The only way to know what to compare models to is to understand the baseline.
(Was this a helpful blog? “It depends!”)