Deep Neural Networks (DNN) have radically changed the landscape of state-of-the-art performance in Natural Language Processing (NLP) within recent years. These versatile models are being used in many applications including text classification, language creation, question answering, image captioning, language translation, named entity recognition, and speech recognition. The state-of-the-art is changing quickly, sometimes leading to large leaps in performance with the release of new architectures. In October of 2018 Google released BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding which performed best in 11 different NLP benchmarks upon release. Since then, there have been many more models adding new components or tweaking the approach. In this article we’ll review some of the traditional machine learning methods used in deep learning and new trends such as Transfer Learning and Transformers to provide a foundation no matter what model is currently leading.
Some Background
Before the advent of DNN, NLP relied on word counts within text data to classify its topic using Bag of Word models such as Naïve Bayes, QIIVS, and SVM. These models count the times that certain words appear under certain topics and calculate the Bayesian probability that a given word will appear under a given topic in new data. However, these techniques ignore a critical factor— that word order (i.e. context) matters. For example, consider the two statements shown below.
In a bag of words model, the two sentences have identical model input. But, such a text mining method throws away key information. Further, when you seek to discern sentiment, things get complicated. Consider if the statements in the following sentences have a positive, neutral, or negative sentiment about the cookie market.
Because sentiment can be ambiguous and subjective even for humans to discern, it is a notoriously hard machine learning problem.
Recurrent Neural Networks (RNN) are designed to effectively process sequential (such as time-series) data used in NLP. However, as RNNs process new data over time they “forget” about previous data. They have a good short-term memory, but don’t remember things that have happened many steps in the past. To solve this, gating mechanisms such as Gated Recurrent Unit (GRU) and Long-Short Term Memory (LSTM) are used in RNNs to retain information over longer series of inputs. GRUs and LSTMs are ML methods used in deep learning to perform tasks associated with memory and clustering.
The application of deep neural networks to NLP problems began capturing context with Recurrent Neural Networks (RNN).
Figure 1. Example of a recurrent neural network
For RNNs you feed text tokens (e.g. sunny, day, etc.) into your neural network sequentially and pass the output after each step to be used along with next input token, this creates a form of memory that keeps past information. You can retain word order and bi-directionally train the model (looking at the text sequence from left-to-right and right-to-left) to gain a deeper understanding of language flow and context as shown in Figure 2.
Figure 2. Bi-directional training of a recurrent neural network
Originally proposed by Hochreiter and Schmidhuber in 1997, this Long Short-Term Memory (LSTM) network architecture captures a sequential state (a type of memory). Their concept is, “as a neural network performs millions of calculations, the LSTM code looks for interesting findings and correlations. It adds temporal context to the data analysis, remembering what has come before and drawing conclusions about how that applies to the neural net’s latest findings. This level of sophistication makes it possible for the AI to start building its conclusions into a broader system—teaching itself the nuances of language.” In other words, LSTMs keep a state variable that it curates based on the input (Figure 3).
Figure 3. LSTM cell example with state as the top horizontal line passing between cells
For example, if the model sees a word such as “not”, it is going to update the state to negate whatever follows. In translation tasks, LSTM models can keep track of gendered nouns. These states can be passed to other layers but are typically only internal to the LSTM cell.
All neural networks are great at discovering features that can then be used for labeling or classification tasks, but LSTMs excel at making predictions based on time series data. LSTMs can be used with any type of sequential data, from text, to audio and video, to AI. They are used for language translation, speech recognition, composing music, image captioning, and even in Deep Mind’s AlphaStar program, the first Artificial Intelligence to defeat a top professional player at StarCraft, considered to be one of the most challenging Real-Time Strategy (RTS) games.
LSTMs are a top architecture for time series problems using deep neural networks. However, one of the shortcomings of LSTMs is that they can forget important information as they get near the end of long sequences, replacing key information from the beginning of the sequence with new, possibly less useful information, from later in the sequence. To help address this issue the “attention” mechanism was developed. It takes all inputs and outputs and focuses attention on the set it “learns” is most useful for the specific task. This concept is visually demonstrated in the image captioning example below.
Attention for text classification takes a softmax of all the neural network outputs from the LSTM. This can be thought of as distributing its one unit of attention to the words in the sentence. Through training the model learns which words are the most useful for classification. This is easier to understand in the hierarchical attention network example shown in Figure 4.
Figure 4. Hierarchical LSTM model with attention for text classification
This particular network was trained to classify the sentiment of Yelp reviews. The first layer (word encoder/attention) focuses on each word within a sentence to determine the context for each Yelp review. The second layer (sentence encoder/attention) focuses attention on the sentences that are most predictive of the Yelp star rating. As you can see in Figure 4 it identifies the key words within each line and the key lines within each review.
Transfer Learning
Deep learning models tend to perform best with large amounts of labeled data, but creating labeled data is expensive because it relies on manual annotators (humans) to create the labels (e.g. text captions for images). Transfer learning helps solve this by creating supervised models such as CBOW (Continuous Bag of Words) to train the language model using large volumes of text.
Transfer Learning is a research area in machine learning that focuses on storing knowledge gained from solving one problem to apply it to a different but related problem. For example, knowledge gained while learning to recognize cars could be used when trying to recognize trucks. Transfer Learning became popular in the field of NLP thanks to the strong performance of different algorithms like ULMFiT, Elmo, BERT etc. These models are trained using a form of self-supervised learning where they can turn any text into a supervised learning activity by creating tasks that don’t require human labeling. The most common of these tasks is to predict a word that has been hidden, i.e. “I left milk and _____ for Santa.”
For the word embedding problem, word context matters. Consider the meaning of “cookie” in the three statements below:
To understand the meaning of “cookie” in each statement you really need to know the context. ELMo is a pre-trained model that captures the entire context of a sentence before deciding how to embed it. ELMo uses multiple layers of bidirectional LSTMs as shown in Figure 5.
Figure 5. Example of ELMo embedding
ELMo is a transfer learning model that determines the optimal embedding for a task from a linear combination of the output of all LTSMs. ELMo uses dynamic embedding to provide a performance lift for NLP models that you build on top of it.
Shortly after ELMo, Universal Language Model Fine-turning (ULMFit) was released by Jeremey Howard of FastAI and Sebastian Ruder. ULMFit seeks to create a transfer learning model that can easily adapt to any NLP classification task. ULMFit uses techniques such as gradual unfreezing, max/mean pooling, dropout, adaptive learning rate, etc. Elder Research recently used ULMFit on a classification project and it performed very well, surpassing the LSTM- with-attention model by about 5% in accuracy. Also, it runs faster than BERT.
Transformers
In 2017, Google released a paper called Attention is All You Need which introduced the Transformer architecture. Transformers are based completely around self-attention, dispensing with convolutions and recurrence. Because of their architecture, they can be computed in parallel, unlike LSTMs. Later, HarvardNLP released The Annotated Transformer which provided details of Transformers and their implementation in PyTorch. Transformers ingest the whole sentence, applying multiple attention mechanisms to it (multi-head attention). This cycle continues for each (N x) encoding and (N x) decoding layer. Transformers can be computed in parallel because they don’t use recurrence; and, since they strictly use attention, they ingest the sentence whole rather than piecemeal. Transformer implementations have a fixed-length context so are limited in the length of text they can handle. For example, BERT and many other implementations are limited to 512 tokens. Google released Transformer-XL in January 2019 as a way of dealing with longer text blocks.
OpenAI released Generative Pre-Training (GPT, GPT2) in June 2018. GPT was a transformer-based model, and the first to really perform well on a variety of NLP tasks. GPT2 followed in March 2019 and contains 1.5 billion parameters. It improves generalization and sets new records for “zero-shot” task performance (i.e., model performance on tasks and data it was not explicitly trained for). Zero-shot tasks first became widely known as an issue in image classification, where a model would occasionally encounter a class never shown in training.
After GPT came BERT (Bidirectional Encoder Representations from Transformers). BERT, released in October 2018 by Google and made available as open source on TensorflowHub, has upended the leaderboard in NLP tasks. BERTs innovations include a new sentence-pair classification task and transformers which are bi-directional. For some NLP tasks these additions deliver a big jump in model accuracy. The table below reveals BERT’s superior average accuracy on several NLP challenges compared to competing methods.
The largest LSTM model Elder Research was using before BERT had ~1.5 million parameters compared to 340 million parameters for the BERT-Large model. BERT is relatively slow, taking minutes to process on a CPU what other models do in a fraction of a second. This can be largely overcome with the use of GPUs, but does require some investment in hardware and/or cloud computing time.
Conclusions
BERT was pre-trained on the BooksCorpus (800 M words) and Wikipedia (2,500 M words) and took 16 Cloud TPUs (64 TPU chips total) 4 days to train ( $6,912 worth of compute time or 68 days for a 4 GPUs). These are only the training costs and times of the final model and don’t represent the computing power that was used in the development process. With the leading-edge method requiring so much computing power it is no surprise to see that the models that have come after BERT (mtDNN (Microsoft), StructBERT (Alibaba), Snorkel + BERT (Stanford), XLNet (Google Brain), RoBERTa (Facebook AI), Adv-RoBERTa (Microsoft D365 AI & UMD), ALBERT (Google), and others) come from the major tech companies and research institutions. These organizations have made large investments due to the immense value that NLP can bring to a business. But thanks to open source programming and research, these models are available for use by individuals and smaller companies, allowing everyone to benefit from the computing time employed. Though these models (and the requisite compute time) might not be suited for all applications, they do represent the best available accuracy and versatility.
In the near future we will surely see ideas (like attention) and architectures (like transformers) play important roles in pushing accuracy higher. Almost certainly, models will continue to use transfer learning (whether for training the weights or developing the architecture) and self-supervised training. By leveraging the vast amounts of written text, models will continue to surpass what is possible with small amounts of labeled data.