Analytics and Machine Learning techniques are important decision-making tools. Analytics asks questions from data and gets answers, largely for predicting what is likely to happen. But, it’s easy to get confused by terminology such as supervised and unsupervised learning. What exactly is unsupervised learning? How does it differ from supervised learning, and why are these terms important?
Simply put, a model is unsupervised if there is no target or outcome variable. That is, a model to distinguish fraud from not-fraud is supervised, because it has a target variable (fraud), but an analytic project to find “what segments of customers exist” is unsupervised.
Operationally, the development of unsupervised models requires much greater input from human experts. The planners of the project need to provide adequate time and extra funds for collecting this input from the experts. Project planners should also realize that such experts will usually need to reprioritize their current tasks, so they will have time to provide their input.
The data associated with unsupervised models does not include historical outcomes as a key question of interest. Consequently, data scientists have fewer techniques at their disposal for building unsupervised models than they have for building supervised models. Since there are no known outcomes in the data when building unsupervised models, there is no way to directly compute the accuracy of these models. To develop a measure of the unsupervised model’s predictive power, human experts must manually review a sample of model outcomes and tag the outcome as useful or not.
Now that you better appreciate the importance of understanding these terms, let’s explore them in more detail using a simple illustration. Let’s assume we want to build a model that will use certain variables, or inputs, to predict what grades students will earn. In the situation depicted in Figure 1, let’s further assume that we have data that shows the grades that five students earned and the input variables relating to these students. Note that we have classified the input (independent) variables into two categories.
- Treatments are inputs that we can alter. In our example, they are the instructors, labs, and teaching assistants.
- Covariates are inputs we do not have the ability to change. In our example, they are the overall GPA, a student’s gender, and a student’s prerequisite grade.
With this historical information, we can use supervised learning to build a model to predict the target outcomes (grades) for other students for whom we know the input (independent) variables.
If we do not know the target outcomes (grades) earned by the five students (see Figure 2), we might be able to group students into categories. That is, we might first use an unsupervised clustering algorithm to look for relationships in the data that naturally group students. Then we might look more closely at the characteristics of the groups that the clustering algorithm identified, and we may even be able to label the groups in some way that is helpful for knowing how to serve them (e.g., first-generation college, out-of-state, commuter, etc.).
To collect the information needed to make these predictions, we will interview subject-matter experts within the organization (e.g., teachers and administrators), examine records of previous experience, and employ other data-collection techniques. This model-building process is called unsupervised learning.
Unsupervised learning is technically more challenging than supervised learning, but in the real world of data analytics, it is very often the only option. For example, to create enough labeled cases when building a model to detect fraud, it’s usually impractical to investigate (and thereby label) enough cases in a sample of data to see whether fraud exists. Without definitive historical information about target outcomes, the data scientist must use unsupervised learning techniques to build a model, and then look for anomalies (unusual cases) as a good place to start investigations.
To summarize, supervised learning has target or outcome variables. It uses known cases to find similar types of cases in future data. Unsupervised learning, where there is no target or outcome variable, is more technically challenging than supervised learning and requires more input from subject-matter experts.