After a first pass of training and evaluating a model, you may find you need to improve its results. Here is a checklist, adapted from Chapter 13 of the Handbook of Statistical Analysis and Data Mining Applications, of ten practical actions that I’ve found usually help:
1. Transform Real-valued Inputs to be Approximately Normal in Distribution
Regression, for instance, behaves better if the inputs are Gaussian; extremes have too much influence on squared-error. For variables that are typically log-normally distributed, like income, this involves transforming the variable via a logarithm or the more general Box-Cox function.
2. Remove Outliers
Note the extremes of each variable and investigate any that are too many standard deviations from the mean. (This is iterative, as outliers have an undue effect on the calculation of the deviation itself.) Don’t necessarily assume it is an error (it could be a finding!), but set them aside. Outliers in the target, Y, can affect the model everywhere, and outliers in the inputs, X, can have too much leverage; defining the slope, say, for the whole data. Even if the outlier is true and not an error, it can be best to remove its undue influence. Ask, “Where (in the data space) will this model be used?” and concentrate the model’s attention on that region.
3. Reduce Variables
Correlation: Variables are often very similar to others, so those with high (say, 99%) correlation may be redundant. Beware however, of outliers (again) that can hide or enhance correlation. For instance, we once had a data set where we discovered (painfully) that missing values were sometimes miscoded with a large positive number (e.g., 99999). This led to chaotic noise in the correlation checks, where the value reported depended only on whether incorrectly-coded missing cell values in two variables were on the same case or not.
Principal Components (PC): You can often replace the original variables, X, with a smaller subset of PCs while still retaining the vast majority of the variance in the input data. The PCs are linear transformations of X designed to be mutually orthogonal and span as much of the input space as possible. For instance, one might be able to represent 90% of the space covered by cases in 150 variables using only 20 PC dimensions. Note however, that PCs don’t consider the output variable when being defined, so they may not be the best vocabulary (data transformation) for classification. Also, they still require measuring all of the X variables to be calculated.
Follow the Results of Variable-selecting Algorithms: Many algorithms, such as neural networks or nearest neighbors, don’t select variables. But, you can first run different algorithms that do — such as stepwise regression, decision trees, or polynomial networks — and follow their lead. Try using only the superset of variables that they pick up. There is no guarantee this will be the best set for your algorithm, but the approach often proves useful in practice.
4. Divide and Conquer
Many simple models may be more accurate than a single complex one. You can remove from training any simple subset of the problem you can clearly define, and focus modeling energies on the hard part remaining. For instance, if all patients with a certain symptom are to be recommended for immediate treatment, take that set out of the data as a known situation, and train on the rest. Slice the data enough this way and many aspects of the problem will change, leading to the possibility of a novel discovery due to the novel perspective of what the true problem is.
5. Combine Variables to Create Higher-order Features
Don’t try to “build a critter from pond-scum”; use higher-order components. For instance, on a trajectory estimation problem, calculate where the craft will land without any complex effects, such as Earth’s rotation and air resistance, and have the modeling algorithm estimate the shortfall instead of the full distance. Likewise, feeding a model every pixel value or time-series point will usually fail. Yet, higher-order features, like edges or trends, can be tried creatively; leave it to a variable-selecting modeling algorithm to pick and choose the particular feature versions to use.
6. Impute Missing Data
The easiest way to handle missing data on training is to not allow it. But this means deleting either a case or variable for which a cell is empty, so it may overly reduce your data. To get the benefit of the information in slightly “holey” cases, try filling in the data gaps with different alternatives:
- Mean of known cases for the variable
- Last value (if cases are in order)
- Estimated value from other known input values (best, but most complex)
- If categorical, you can add the label “missing”
7. Explode Categorical Variables to Allow Use of Estimation Routines
A categorical variable can’t be used directly in an estimation algorithm like regression or neural networks. But, you can “explode” a C-category variable into C-1 binary variables where each holds a 1 if the category is its value for that case. (However, note suggestion #8, below.)
8. Merge Categories if There are Too Many
Each value of a categorical variable is usually allowed to have its own parameter by an algorithm, so overfit is very likely if there are too many categories. For instance, a person’s State can have over 50 values (with Washington, DC and Puerto Rico, etc.) but the important (well-populated) states in a database may be quite fewer. Keep the large categories and merge the remaining into “Other” or divide them by domain knowledge, such as regions. Always look for differences among categories and merge them if they aren’t significant enough.
9. Merge Variables with Similar Behavior
If you have real-valued variables which are strongly correlated, you can average them to create one candidate variable. For binary variables, you could examine their expected value conditioned on the output in the training data set, and create a super-variable which is the sum of those which tend toward one of the classes of the output. For instance, people might have flags (bits) reflecting their interest (or lack thereof) in Nascar racing, Vegetarian cooking, Field and Stream magazine, progressive rock, etc. You could find which are correlated with voting patterns for each party over the training data and create one variable for each party to hold the sum of these bits. The union of those bits (a super-variable) will be much less empty than the original variables.
10. Spherify Data
Many algorithms (e.g., nearest neighbors, clustering) need the variables to be on the same scale and be independent. But, these conditions are not natural in practice. For instance, a person’s weight might be in pounds and height in inches, so a unit step in height is much more significant than a unit step in weight. To handle scaling and offset, you can normalize the data by transforming each variable by Z = (X – m)/ s where m and s are the mean and standard deviation of X. But a more serious issue is correlation. To properly account for this, I recommend using the Mahalanobis Distance function, which employs the inverse of the correlation matrix of the variables. (It is not implemented in commercial tools as often as it should be!) A beneficial result is that, for instance, the Mahalanobis distance from a point to the mean of the data is a good measure of that point’s atypicality.