Many organizations have large amounts of unstructured text data that is inaccessible to their analytics systems. Unstructured text doesn’t have tags attached to the text describing the meaning, connotation, and denotation of the words. Trying to make sense of it is like trying to read this page without using a brain trained on the semantics and grammar of English. Text Mining and Natural Language Processing (NLP) provide the machine equivalent of a brain capable of reading — that is, of extracting structured information from text.
Text contained in PDF files may contain images of text, called glyphs, rather than the actual text characters and must be processed using OCR (optical character recognition). Once you have machine-readable text — from OCR, from text extraction of PDFs, from HTML web pages, word processing documents, or in structured databases — the main work of Text Mining can begin.
Primary Techniques
Text Classification
Text Classification determines the label of a document. A document could be a tweet, a line of text from a financial report, a web page, or a multi-page PDF document. The label for your document could denote a sentiment such as positive/neutral/negative, a rating from 1 (worst) to 5 (best), or a type such as Running Header, Footnote, or Section 2. These examples are mutually exclusive, but you could also allow a document to have more than one label, which is more difficult. We have, for example, trained deep neural networks (DNNs) to classify lines in the pages of an audit, sentiment in financial news, and severity of health issues in medical notes.
Topic Modeling
Topic modeling is employed to explore the topics that your documents contain. One typically sets the number of topics as an argument and then determines which documents focus most on each topic. Popular techniques include Latent Dirichlet Allocation and Non-negative Matrix Factorization. While Topic Modeling can be implemented comparatively quickly, it has the downside that the “topics” that the models create are mathematical artifacts involving word co-occurrences that don’t necessarily correspond to what a human might naturally describe as a topic. So, the topics discovered may not be important to the question that you want to ask of your data.
We have extensive experience using Topic Modeling to explore and partition documents. For example:
- Web pages – We found an institutional website contained pages in Spanish that the client was unaware of.
- Audit findings – We partitioned findings into informative sub-categories instead of the catch-all “financial” category. We then built on these sub-categories — with input from subject matter experts — to create a Text Classification model to detect the sentiment of financial experts speaking about different market segments.
Document Similarity
You may not be ready to define particular topics of documents, but know that you want to find more documents similar to a particularly interesting document. Most advanced text mining techniques use word embedding to pre-process words instead of using unmodified words. These embeddings turn words into codes that reflect word similarity; words that are used in similar contexts will result in codes that are close to each other. A common such technique is Word2Vec. When applied to an entire document an embedding can be used to find other documents that use similar, or equivalent, words in similar contexts. A popular method is Doc2Vec, which we used to help auditors answer the question, “Which other audits have findings similar to this interesting one?”. The method could take into account obstacles such as synonyms and writing styles that could otherwise obscure the fact that two findings are similar even if they used different words.
Text Information Extraction
Information extraction parses through the text to discover named entities (people, organizations), actions and their objects, or other specific targets. This allows you to answer questions like “What other companies are mentioned in this firm’s financial documents?” Or to determine which company is reported to have transferred funds to another company, regardless of whether the statement is made with the active or passive voice — which can confuse simpler, pattern-based approaches.
If the text is highly formatted, a simpler pattern-based approach could work well. For example, if report dates are in a standard format such as, “Report Date: 05/14/2020”, then Natural Language Processing isn’t needed. But NLP is more flexible and expandable if strict formatting isn’t followed.
Audio Transcription
If you want to analyze spoken words, you can translate the sounds into text to use as input to one of the other techniques described here. But with sufficient data you can train an end-to-end DNN that takes audio as input and directly creates outputs — skipping the intermediate transcription. Still, having a transcript is often useful and can drive many follow-on techniques.
Case Studies
Chatbots and Q&A – Many people are happy to text chat with an agent online rather than wait for a person to answer a call. Chatbots can handle some whole conversations or be used to gather initial information, triage the customer’s needs, and direct them to a human agent.
Text Translation – It is challenging to translate text from one language to another. Commercial players (Google, Microsoft, Amazon) offer state-of-the-art services that can save a lot of time. In specific situations, it is possible to train a custom translator, preferably using transfer learning, to take advantage of these commercial models trained at great expense. Transfer-learning maintains most of the pre-trained model — it has been trained on enormous sets of documents — and uses your own documents to train and refine the model to yield results more specific to your use case. This may involve replacing parts of the pre-trained model and while it brings enormous value to the task it is not easy to do properly.
Text Summarization – Text summarization can be done in two ways using one of two methods: 1) Extractive Summarization is analogous to using a highlighter in a document to emphasize important sentences. It uses a model to determine which sentences are key and should be included in a summary, while ignoring others.
2) Abstractive Summarization uses a Text-Translation-style model to “translate” a longer text to a shorter, more concise one in the same language.