Machine learning techniques model relationships in a training data set — “learn from known data”– to “apply that knowledge to new data”; that is, employ the model to predict or fill in the output values for new data, where that variable is unknown. One seeks to fit the model accurately to the training data without going too far (overfitting), as the goal is to do well out-of-sample on the new data. Sophisticated algorithms try many possible model forms, which increases the possibility of finding descriptive structure in the data if it is there – but also if it isn’t, which is called over-search, where a spurious correlation is mistaken for a real one. So, the great potential of machine learning – discovering a hitherto unknown relationship, is also its greatest danger – finding a fake relationship, and skill and experience are needed to wield the power of data science and machine learning expertly.
There are two major types of machine learning:
- Unsupervised Learning – When you have no labeled output variable, you can still discover similarity among your input data, which can be used to cluster cases in groups or to find anomalous cases.
- Supervised Learning – When you have a labeled output (target) variable, you can predict or classify it using the input variables. The target (or dependent variable) can be a state or class (like win vs. lose) or an estimate (like sales volume).
Modeling Techniques
There are many powerful and popular machine learning algorithms that operate on structured tabular data, including linear or logistic regression, decision trees, k-nearest neighbors, and polynomial or neural networks. Refinements have been made over the years, making them faster and more customizable and widely available across multiple analytic platforms. When thoughtfully constructed, these models can be both accurate and explainable, and their results can improve organizational knowledge about the drivers of critical outcomes.
John Elder and others discovered (in the early 1990s) that ensemble models, derived from a collection of simple models (often decision trees), will usually outperform individual models on out-of-sample data, without leading to overfit (an apparent paradox that Dr. Elder revealed the explanation for in a JCGS journal article). Modern ensemble tools like XGBoost can produce very accurate models even before feature selection or engineering. However, the transparency of such models is low in the sense of understanding how the features work together to create the output. But post-processing tools such as Locally Interpretable Model-Agnostic Explanations (LIME) can improve interpretability by creating a linear approximation of the model score surface at the location of a point of interest, and thereby deliver observation-level insights into the scores of any model when required.
Today machine learning extends to unstructured data and complex use cases. Text mining, natural language processing, deep neural networks, image and video processing, and graph analytics have made machine learning popular and widespread as the benefits become clear.