Since its founding more than twenty years ago Elder Research has been involved in hundreds of data mining projects. Most of those projects employ numerical data, but for about a decade now we have been called on increasingly to extract information from unstructured or semi-structured text. Though Gartner recently classified Text Analytics as just exiting the “Trough of Disillusionment” on their famous “Hype Cycle”,[1] we have found that every text mining project we have worked on has been a success.
Our team has found valuable new information, and the project has met its goal, sometimes dramatically. But that may have been because there has been low-hanging fruit in many text datasets, due to no one having worked in that orchard before.
In part, the fact that text analytics is relatively new contributes to the success that organizations see when first implementing a text mining project. A lot of times organizations have been collecting text data – maybe what customers are telling them – and they haven’t yet looked at it analytically. If so, some simple text analytics efforts may pay off right away.
Once the low-hanging fruit is gone, however, further results require hard work. To squeeze more juice out of the rock you need to use sophisticated methods. It can seem like 90% data preparation work to get all of the features lined up so that you have nice fat keywords to work with. We use some commercial software, but because text analysis is a relatively new field it’s been one of those areas where we’ve also had to build some tools ourselves to get the capabilities that we wanted.
One example of a project where commercial and custom software was required to achieve our goal with text analytics is a claims approval project with the Social Security Administration. President Bush charged the SSA with finding a way to automatically take care of approving those people applying for disability support who had “obvious” cases; that is, those with clearly-approvable cases containing all the necessary documentation. This would create the equivalent of an express lane so the adjudicators would not have as many cases to review, which would improve service for everyone. Only about a third of the people who apply for disability get it eventually, but actually more than two-thirds are turned down initially, because the 1/3 approval proportion includes those who win on appeal. For those who are turned down there are five levels of appeal available (during which time their case can strengthen by their health weakening!) and it can take up to two years to move through those layers!
The point of the project was to find the cases that are obvious approvals. Identifying claims for disability that meet the requirements for approval was a time-consuming and error prone process. Not only were claims taking much too long for very ill or elderly claimants, the case-load is so great that many were not even being looked at for the first time for many months! Yet a computer could “look at” or score an application immediately.
Text analytics is well-suited to this application. Human adjudicators are spread out over every state and they are not consistent in their rulings, unlike a computer model. True, a computer has no common sense, or judgment, or the ability to reason about anything that it has not seen before. But it does have the advantage of “seeing” millions of cases, where a human may only have seen hundreds. And, it can use statistics to average, or get the “gist” of the overall pattern among the vast repository of previous human judgments. Thus, the computer model should do best on the types of cases that appear over and over again, and where the approval decision is clear. If those cases could be automatically flagged for likely approval and have a fast-track process for confirmation and resolution, then everyone would benefit. With the clear approval cases out of the way, adjudicators could review the remaining, more borderline cases and process them more efficiently than has been possible in the past.
Here, the text was semi-structured, in that the words were not random, but were focused on the specific topic: “What is wrong with your health?” Still, even with a rich set of keywords in this “health allegation” field it took significant effort to find the right solution. We had to solve two major technical issues (which I plan to detail later):
- Estimate the probability of approval associated with each keyword
- Combine the keyword probabilities to obtain an application probability
For the former, I came up with a Bayesian technique that I named (very un-dramatically) “non-zero table initialization”, which smoothly transitions from the prior approval probability (1/3) to the keyword’s frequentist estimate as more data accumulates. For the latter issue, we made a list of desired properties of the equation that would combine evidence (such as “lots of little ailments should not add up to equal one big ailment”, and “a little ailment should not detract from a big one”, etc.). No “textbook” equation worked for all those properties, so we had to engineer a custom equation that met all of the common sense criteria.
That solution worked very well, and was well received by the client, as it was clear and understandable. And the Allegation text information was worth more than all the other information put together! It was a lot of hard work to get to that point – to find and create features that capture the information as well as possible and then relate those features to the desired outcome. But the end was well worth it. The model results were more accurate and more consistent than any single adjudicator or doctor’s decision and allowed 20% of the claims that would eventually be approved to be identified immediately.
[1] I suspect Gartner puts it there due to a gap between expectations and outcome. But, with realistic expectations, great outcomes are possible. Many examples are discussed in our award-winning book on Practical Text Mining.