Tired of German-French dataset? Look at Yemba, and stand out. Mechanics of LSTM, GRU explained and applied, with powerful visuals and code in Keras.
We gently explain how LSTM cells work, by walking you through a funny example. You will understand why Long Short Term Memory (LSTM) has been so effective and popular for processing sequence data for Apple, Google, Facebook, Amazon. We don’t stop there. We code the first ever built LSTM-based classifier of words from the African language, Yemba. No frontiers can stop LSTMs. Transformers, maybe.
Previously we introduced recurrent neural networks (RNNs) and shown how they are successfully used for sentiment analysis.
The issue with RNNs is long range memory. They are able to predict the next word “sky” in the sentence “the clouds are in the …” But they come short in predicting the missing word in the following sentence:
“She grew up in France. Now she has been in China for few months only. She speaks fluent …”
As that gap grows, RNNs become unable to learn to connect the information. Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. In natural language text, it is entirely possible for the gap between the relevant information and the point where it is needed to be very large.
Why do RNNs have huge problems with long sequences? By design, RNNs take two inputs at each time step: an input vector (e.g. one word from the input sentence), and a hidden state (e.g. a memory representation from previous words). The next RNN step takes the second input vector and first hidden state to create the output of that time step. Therefore, in order to capture semantic meanings in long sequences, we need to run RNNs over many time steps, turning the unrolled RNN into a very deep network. Just like any deep neural network it will then suffer from the vanishing and exploding gradients, thus taking forever to train. Many techniques could alleviate this problem, but not eliminate it:
- initializing parameters carefully,
- using non saturating activation functions like ReLU,
- applying batch normalization, gradient clipping, dropout,
- using truncated backpropagation through time.
These workarounds have their limits, still. Additionally, besides the long training time, another problem faced by long running RNNs is the fact that the memory of the first inputs gradually fades away. After a while, the RNN’s state contains virtually no trace of the first inputs. For example, if we want to perform sentiment analysis on a long review that starts with “I loved this product,” but the rest of the review lists the many things that could have made the product even better, then, the RNN will gradually forget the first positive sentiment and will completely misinterpret the review as negative.
In order to solve these RNNs problems, various types of cells with long term memory have been introduced in research. In practice, basic RNNs are not used anymore and most of work is done using the so-called Long Short Term Memory (LSTM) networks. They were invented by S. Hochreiter and J. Schmidhuber at the Technical University of Munich (heart touch for me – Hochreiter was actually inventing LSTM in that same faculty where I studied, 5 years later).
Each single LSTM cell governs what to remember, what to forget and how to update the memory. By doing so, the LSTM network solves the problem of exploding or vanishing gradients, as well as all other problems mentioned above! A key idea in the LSTM is the mechanism called gate.
The architecture of a LSTM cell is depicted below. Quite impressive, isn’t it? h is the hidden state, representing short term memory. C is the cell state, representing long term memory. x is the input. The cell performs only few matrices transformations, sigmoid and tanh activation in order to magically solve all the RNN problems. We will dive into how this happens in the next sections, by looking at how the cell forgets, remembers and updates its memory.
Let’s explore this ugly diagram with a funny example. Assume that you are the boss, and your employee asks for salary increase. Will you agree? Well, this will depend, let’s say, on your state of mind. Below we consider your mind as a LSTM cell, with no mean to offense your lightning brain.
Your long term state C will impact your decision. On average, 70% of time you are in good mood and you have 30% of total budget left. Therefore your cell state is C=[0.7, 0.3]. Recently, things are really going well for you, boosting your good mood with probability 100% and you have operative budget left with high probability 100%. Therefore your hidden state is h=[1, 1]. Today, three things happened: your kids succeeded at school exams, although you got an ugly review from your boss, however you figured out that you still have plenty of time to complete the work. So, today’s input is x=[1, -1, 1]. Based on this situation, will you give a salary increase to your employee?
In the situation described above, your first step will be probably to figure out how things which happened today (input x) and things which happened recently (hidden state h) will affect your long-term view of the situation (cell state C). LSTM forgets by using Forget Gates to control how much of the past memory is kept.
In the case of your employee’s request of salary increase, your forget gate will run the following calculation of f_t, whose value will ultimately affect your long-term memory. The weights shown in the picture below are chosen arbitrary for illustration purposes. Their values are normally calculated during training. The result [0,0] indicates to erase (forget completely) your long term memory because it should not affect your decision on this case.
Next, you need to decide which information about what happened recently (hidden state h) and what happened today (input x) you want to record in your long-term view of the situation (cell state C). LSTM decides what to remember by using gates, again.
First, you will calculate your input gate values i_t, which falls between 0 and 1 thanks to sigmoid activation. Next, you will scale your input between -1 and 1 using tanh activation. Finally you will estimate your new cell state by adding both results. The result [1, 1] indicates that based on the recent and current information, you are 100% in good mood and very likely to have operative budget funds.
Now, you know how things which recently happened would affect your state. Next, it is time to update your long-term view of the situation based of the new rationales. When new values come in, LSTM decides on how to update its memory, again by using gates. The gated new values are added to the current memory. This additive operation is what solved the exploding or vanishing gradients problem of simple RNNs. Instead of multiplying, LSTM adds things to compute the new state. The result C_t is stored as the new long-term view of the situation (cell state). The values [1,1] suggest that you are overall 100% of time in good mood and 100% likelihood to have money all the time!
Based on this information, you can update your short-term view of the situation h_t (next hidden state). The values [0.9, 0.9] indicate that there is 90% likelihood that you will increase your employee’s salary in the next time step! Congratulations to him!
Gated Recurrent Unit
A variant of the LSTM cell is called the Gated Recurrent Unit, or GRU. It is a simplified version of the LSTM cell, can be a bit faster than LSTM, and it seems to perform similarly, which explains its growing popularity. GRU was proposed by Kyunghyun Cho et al. in a 2014 paper.
As shown above, both state vectors are merged into a single vector. A single gate controller controls both the forget gate and the input gate. If the gate controller outputs a 1, the input gate is open and the forget gate is closed. If it outputs a 0, the opposite happens. In other words, whenever a memory must be stored, the location where it will be stored is erased first. There is no output gate; the full state vector is output at every time step. However, there is a new gate controller that controls which part of the previous state will be shown to the main layer.
Stacking LSTM cells
As we have just proven, an LSTM cell can learn to recognize an important input (that’s the role of the input gate), store it in the long term state, learn to preserve it for as long as it is needed (that’s the role of the forget gate), and learn to extract it whenever it is needed.
By aligning multiple LSTM cells, we can process input of sequence data, for example, a 4-words sentence in the picture below. LSTM units are typically also arranged in layers, so that each the output of each unit is the input to the other units. In the example below we have 2 layers, each having 4 cells. In this way, the network becomes richer and captures more dependencies.
RNNs, LSTMs and GRUs are designed to analyze sequence of values. Sometimes it makes sense to analyze the sequence in a reverse order. For example in the sentence “he needs to work harder, the boss said about the employee.”, although the “he” appears at the very beginning, it refers to the employee, mentioned at the very end. Therefore the order has to be reversed or by combining forward and backward. This bidirectional architecture is depicted in the figure below.
The followin diagram further illustrates bidirectional LSTMs. The network in the bottom receives the sequence in the original order, while the network in the top takes receives the same input but in reverse order. Both networks are not necessarily identical. Important is, their outputs are combined for the final prediction.
We had enough of theory. Now, we will implement a LSTM network for predicting the probability of the next character in a sequence, based on the characters already observed in the sequence.
Our sequences will be words from a not so popular language, called Yemba. You might have never heard about that language before. Feel at ease to look at pictograms below to get an idea of Yemba writing.
Yemba is an African language spoken by just few thousand native speakers today. Despite originally being a spoken language exclusively, Yemba writing was developed ca. 90 years ago. Like so many languages in the world, Yemba is a tone language, similar to Vietnamese. In tone languages, words are made of consonants, vowels and tones, the variation of musical pitch which accompanies the utterance of a syllable. The foundational model of tone orthography in Yemba was put in place by His Majesty Chief Djoumessi Mathias of Foreke-Dschang, the pioneer who designed the first Yemba alphabet in 1928. Later, in 1997 a modern Yemba-French dictionary was created as a result of a joined international research effort.
Our goal is to encode Yemba words as embeddings vectors, and to build a LSTM network that is able to predict whether a Yemba word is a noun or rather a verb. We do not aim to implement a part-of-speech tagging. Rather, we will train the network to learn groups of letters and tones which commonly appear in Yemba nouns, compared to those which are specific to Yemba verbs.
For this purpose, we use a pre-processed English-Yemba dataset downloaded from the Yemba.net online dictionary. We encourage you to visit the page and hit few translations from English, French, German, Chinese, Spanish, Italian to Yemba. It’s fun.
Above we can see few words from the dictionary. You can try to read them if you want to have good mood. Actually, Yemba writing is based on the International Phonetic Language. Anyone with knowledge of phonetics could actually be able to read and speak Yemba! Theoretically. Although we restricted our dataset to nouns and verbs, Yemba also includes adjectives, adverbs, conjunctions, pronouns, etc., however in limited number compared to nouns and verbs. The distribution of word types is shown below.
Below we show few statistics about our dataset.
As we can see below, our Yemba words are built from a 45 letters alphabet (vocabulary). The vocabulary represents each Yemba letter by a unique integer. This is a typical preprocessing step in natural language processing.
Before feeding the words into a LSTM, we have to tokenize each word, by replacing each letter in the word with its index from the vocabulary. This process turns the word into a vector of numbers X. In order to have the same vector size for all words, we pad the vectors to the length of the longest word in our dataset. Our LSTM will learn to match those vectors to the correct word type: 0 for a noun and 1 for a verb. Therefore we also build a vector of labels Y to store the correct classes.
Next, we split our vectors into a training set with 2166 words, and a validation set with 542 words.
Below we build a 1-layer LSTM with 100 cells. We do not feed the words vector directly to the LSTM. Instead we first learn their embeddings representation in a 8-dimensional space. Embeddings are known to capture the semantic relationships between the letters building the word. The output from the LSTM is transformed by a fully connected layer with sigmoid activation to produce a probability between 0 and 1. We train the network by using binary crossentropy as loss function and Adam as optimizer. Classification accuracy is used as metric for evaluating the classification performance since our two classes are pretty well balanced.
As shown above, the network achieves convergence very fast. The LSTM is doing a great job, by predicting with 93.91% classification accuracy on test dataset.
We also tried GRUs on our Yemba classification problem, by replacing the LSTM layer by a GRU layer with the same number of units (100). Training was a little bit faster, and accuracy was a little bit better, 94.19% on validation set, instead of 93.91% as with LSTM.
Let’s now evaluate the trained LSTM network on 100 random nouns and 100 random verbs and check the confusion matrix.
The confusion matrix exhibits few false positives and false negatives: 1 verb was predicted as noun and 2 nouns were predicted to be verbs out of 100. The miss-classified words are shown below.
Fact is, verbs in Yemba usually have a noticeable prefix “e”, “le” or “li”, which corresponds to the “to” in the English language. It appears that this semantic or grammatical construct was correctly picked up by our character-based LSTM. The word “ekubli”, which means “avarice” in English, starts with “e”. Therefore, the LSTM predicted it as verb, although it is a noun. Same for “ lefɔ” which means “dynasty”. Because of its prefix “le”, it was considered to be a verb.
This practical case shows the magic of LSTMs in grasping semantic meanings even in remotely inaccessible languages with a strange vocabulary and phonetic nuances, such as Yemba.
LSTMs have transformed machine learning and are now available to billions of users through the world’s most valuable public companies like Google, Amazon and Facebook. LSTMs greatly improved speech recognition on over 4 billion Android phones (since mid 2015). They greatly improved machine translation through Google Translate since Nov 2016. Facebook performed over 4 billion LSTM based translations per day. Siri was LSTM-based on almost 2 billion iPhones since 2016. The answers of Amazon’s Alexa were based on LSTMs.
And now, even the Yemba language has gone through a LSTM network.
If you want to know even more about LSTMs and GRUs, check this article with amazing animations by Michael Nguyen. For those who prefer to build their own LSTM from scratch, this article might work.
Attention-based sequence-to-sequence models and Transformers go beyond LSTMs and have amazed folks recently with their impressive results in machine translation at Google, text generation at OpenAI. You might want to check this blog.
A comprehensive implementation of text classification using BERT, FastText, TextCNN, Transformer, Se2seq, etc. can be found on this GitHub repository .
Thanks for reading more articles from me, below.