Isolation Forests: Identify Outliers in Data

In this video, senior data scientist Jericho McLeod walks us through an anomaly detection method called Isolation Forests. He demonstrates how to use the technique to quickly and accurately identify outliers by isolating data points. This method has many advantages, including its speed, ability to generalize, and low memory usage.

More on Anomaly Detection: The CADE Method

Anomaly detection tools also help data practitioners assess risk and identify potential cases of fraud across multiple industries. In this video, data scientist Garrett Pedersen demonstrates how anomaly detection methods like CADE help locate outliers in large, multidimensional data.

Learn the Method

Transcript

Hi, I’m Jericho McLeod. I’m a senior data scientist here at Elder Research, and I’m going to use machine learning today to help me identify wines that I should pick up.

Now, my goal is to have a wine tasting and compare some specific characteristics of wine to see what I like more. In other words, using data-driven analytics to select wines.

So I’ve got malic acid from the UCI machine learning repository data set on wine, malic acid being what gives wines their sour taste. And I’ve got informational flavonoids. Flavonoids give wine a general mouth feel flavor, but too much might not be a good thing.

What are Isolation Forests?

I’m going to use a method to identify anomalies in these two characteristics. It was originally pioneered by Liu, Ting, and Zhou. That method is isolation forests. You might have heard of other machine learning methods using the word “forest”, and there’s some similarities. The most common being random forest, which use decision trees under the hood, which is a way to separate data out into like categories. In other words, in each group, you get a similar group of things that you then throw a label on and say, you know, these are all one thing.

Unlike that, an isolation forest seeks to separate data completely, like using a thing called a binary tree. Just as a quick example, if I had two data points right here, and I separated at five on malic acid, these two are separate.

Applying the Technique

Now, if I want to actually apply this to real data, what I’d be looking for is how easy it is to separate a data point from the rest of the data.

So if I look at D, anywhere between, you know, five and six will separate D in one cut. E, similarly, at 4.5 is completely separated in that entire range. If we were randomly drawing these cuts, I have a pretty good probability of hitting that, compared to hitting something over here, that separates out a single point. Now another example is A, in the middle here, it takes a minimum of two cuts. And then similarly, for C, I have to make three cuts to completely separate it from all the data. And then B, over here, it takes four. I have to completely block it in to separate it.

So if we actually write these out, I have A, B, C, D, and E. A takes two, B takes four, C takes three, D only takes one, and E only takes one. But these are lower bounds for this method. In other words, there’s not a way to completely isolate any of these points that’s less than that amount.

So if we do a lot of random binary trees, where we completely separate each point by randomly cutting the data, the least amount of cuts to separate these will move the average for the point. So if we actually apply the algorithm on a computer, we get things like D is a negative 0.05 as is E, but then B gets a positive 0.31, A gets a 0.10, and C gets about a 0.2.

Now when we score these, I’m using the scoring method that’s common in libraries like you learn. So anything negative is going to be an outlier, and anything positive is an inlier, and you can even tell this being more central to the data, it gets a higher number, and as you get lower, it’s becoming more like an outlier. And then D and E are clearly outliers of my data. So those are the wines that I would pick, if I wanted to compare malic acid and flavonoids directly. And then I could use something like B, as, kind of, a control, middle of the pack.

Advantages of Isolation Forests

Now there’s a few advantages to using isolation for it, and not just on this data set.

Namely, very fast as a matter.

Two, it generalizes quite well, by drawing a lot of random trees over and over again. It doesn’t fit very precisely to this data, which means I can apply it to new data that I haven’t seen before.

It doesn’t take a lot of memory. That makes it really good for large data sets, if you’re concerned about that being an issue.

So that is, in a nutshell, how I can use isolation forests to help select a couple of wines for a tasting this weekend. Thank you.