Representing text in natural language processing

Understanding the written words: gentle review of Word2vec, GloVe, TF-IDF, Bag-of-words, N-grams, 1-hot encoding techniques

This article looks at the representation of language for natural language processing (NLP). If you are one of those rare deep learning gurus, you will likely not learn anything new here. If not, dive with me into the fascinating world of turning words into some representations which algorithms can understand. We motivate why we would want to do this, which approaches exist, and how they work on simple examples. We will avoid as much mathematics as possible, and we will use an easy-going writing style in order to increase the chances that you actually read the article till the end. Although the article seems to be relatively long, it is fun to surf.

The history of language genesis is repeated every time a baby starts to talk. Indeed, language began when humans started naming objects, actions and phenomena which appeared in real life. When looking at the belief in divine creation, something shared by billions of people, we can think of languages to be as old as human kind. Every time we write a short message, a tweet, a chat, an email, a post or even a web page, an article, a blog or a book, we turn thoughts into words or symbols. Thanks to language, humans are able to turn invisible ideas into visible things. Additionally, human thoughts become accessible to other humans, as well as to …, guess what? Computers! If humans are able to reconstruct thoughts from words, could computers do this as well?

With the recent hype called artificial intelligence, it is quite useful to have computers able to process, understand and generate human language. Google translation is a good example, a useful one. Google started by scanning a lot of books from university libraries, including their translations, and learning from the patterns between them. Today, thanks to Google Translate, we can access thoughts encoded in any language we do not speak yet.

Another example is text categorization. Humans are very good at grouping things into categories, for example ‘good’ and ‘bad’. They have been doing this for ages. At the same time, humans are also very good at generating and recording information in the form of text. According to Google, there are approximately 129,864,880 books in the entire world. We bet, you don’t want to categorize them by hand, even if you are given the biggest library in the world, with enough space and shelves. At the same time, you really need to have those books categorized at least by genre: comics, cooking, business, biographies, …etc. That’s where you would embrace a computer program that would read the book’s content and detect its genre automatically. Let’s go away from books for a moment. There is an increasing number of people among us who don’t read books, and who prefer news feeds. They usually receive hundreds of messages, posts, or articles every day. Sorting out spamming messages and fake news from relevant updates e.g. about the industry and the market, has become a very relevant task for them, as long as it is done by a computer program.

Questions-answering bots is another hype nowadays. Imagine how much time you, as customer support specialist, would save if you could have a copy of yourself answering dozen of calls you are receiving every day from your clients asking the same questions over and over. We all heard about Amazon Alexa, Apple Siri or Google Assistant. These systems are automatically answering questions posed by humans in a natural language.

Hopefully, the examples above motivate why having computers processing natural language is an engaging topic for you. If not, looking at further examples illustrating speech recognition, speech imitation and speech synthesis would amaze you for sure.

Now, why all this talk about language and computers? You type text in your favorite word processor or email software everyday. So, why should it be difficult for a computer to understand your text? Language is ambiguous at all levels: lexical, phrasal, semantic. Language assumes that the listener is aware of the world, the context and the communication techniques. If you type “mouse info” to a search engine, are you looking for a pet or a tool? Representation of text is very important for performance of many real-world applications.

Now, how do we turn language into something computer algorithms enjoy? At the base, processors in computers perform simple arithmetic such as adding and multiplying numbers. Is that the reason why computers love numbers? Who knows. Anyway, this problem is solved nicely for images. For example, the area marked with a circle on the picture below is represented by three matrices of numbers, one for each color channel: red, green and blue. Each number tells the level of red, green or blue at the pixel’s location. (0,0,0) is displayed as black, and a pixel whose color components are (255,255,255) is displayed as white.

source

The process of transforming text into numeric stuff, similar to what we did with the image above, is usually performed by building a language model. These models typically assign probabilities, frequencies or some obscure numbers to words, sequences of words, group of words, section of documents or whole documents. The most common techniques are: 1-hot encoding, N-grams, Bag-of-words, vector semantics (tf-idf), distributional semantics (Word2vec, GloVe). Let’s see if we understand what all this means. We shouldbe able to. We are not computers.

If a document has a vocabulary with 1000 words, we can represent the words with one-hot vectors. In other words, we have 1000-dimensional representation vectors, and we associate each unique word with an index in this vector. To represent a unique word, we set the component of the vector to be 1, and zero out all of the other components.

source: Fundamentals of Deep Learning, N. Buduma, 2017

This representation is rather arbitrary. It misses the relationships between words and does not convey information about their surrounding context. This method becomes extremely ineffective for large vocabularies. In the next few sections, we will have a look at a bit more exciting approaches.

We start looking at the most basic N-gram model. Let’s consider our most favorite sentence from our childhood: “please eat your food”. A 2-gram (or bigram) is a two-word sequence of words like “please eat”, “eat your”, or ”your food”. A 3-gram (or trigram) will be a three-word sequence of words like “please eat your”, or “eat your food”. N-gram language models estimate the probability of the last word given the previous words. For example, given the sequence of words “please eat your”, the likelihood of the next word is higher for “food” than for “spoon”. In the later case, our mom will be less happy. The best way to compute such likelihood for any pair, triple, quadruplet, … of words is to use a large body of text. The image below shows few probabilities obtained from a relatively small body of text containing questions and answers related to restaurants and food. “I” is frequently followed by the verb “want”, “eat” or “spend”.

source: Jurafsky et al., 1994

Google (again) actually provides a larger set of probabilities for 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams in multiple languages. They calculated them on sources printed between 1500 and 2008! The Google Ngram Viewer allows you to download and use this large collection of n-grams for the purpose of spell checking, auto-completing, language identification, text generation and speech recognition.

The longer the context on which we train a N-gram model, the more coherent the sentences we can generate. The image below shows 3 sentences randomly generated from 1-gram, 2-gram and 3-gram models computed from 40 million words of the Wall Street Journal.

source: Jurafsky et al., 2018

Even with very large corpus, in general, N-gram is an insufficient model of language because language has long-distance dependencies. For example, in the sentence “The computer which I had just put into the machine room on the fifth floor crashed.”, although the words “computer ” and “crashed ” are 15 positions away one from another, they are related. A 5-gram model will miss that link and our computer adminsitrator might keep thinking that the computer on the fifth floor is running perfect.

Furthermore, the N-gram model is heavily dependent on the training corpus used to calculate the probabilities. One implication of this is that the probabilities often encode specific facts about a given training text, which may not necessarily apply to a new text. These reason motivate us to look at further language models.

When we are interested in categorizing text, classifying it based on sentiment, or verifying whether it is a spam, we often do not want to look at the sequential pattern of words, as suggested by N-gram language models. Rather we would represent the text as a bag of words, as if it were an unordered set of words, while ignoring their original position in the text, keeping only their frequency.

source: Jurafsky et al., 2018

Let’s illustrate the bag-of-words representation of text in a simple sentiment analysis example with the two classes positive (+) and negative (-). Below we have 5 sentences (also called documents) with their known categories, as well as 1 sentence with unknown category. The purpose is to classify the last sentence as either positive or negative.

source: Jurafsky et al., 2018

This task is solved by a so-called Naive Bayes Classifier, which uses the words frequencies in the bag-of-words of each class to compute the probability of each class c, as well as the conditional probability of each word given a class, as follows.

In our example the negative class has the probability 3/5. The positive class will have the probability 2/5. A bit of algebra will show that the probability of the words “predictable”, “with”, “no” and “fun” given the negative class is higher than the sample probability given the positive class. Therefore the sentence “predictable with no fun” will be classified as negative based on the training data.

Bag-of-words language models rely on the term frequency TF, defined as the number of times that a word occurs in a given text or document. Bag-of-words helps in sentiment analysis. It is great in detecting the language a text is written in. It is also used to determine authorship attribution such as gender and age. We can also use the term frequency information to engineer additional features such as the number of positive lexicon words (“great”, “nice”, “enjoyable”), or the number of first and second pronouns (“I”, “me”, “you”) and train more complex classifiers based on logistic regression and even neural networks. But, let’s not go that way of headache for now.

Despite of the glory, N-gram and Bag-of-words models alone do not allow us to draw useful inferences that will help us solve meaning-related tasks like question-answering, summarization, and dialogue. This is why we will look at semantics in the next section.

How should we represent the meaning of a word? The word “mouse” can be found in a lexical dictionary, but its plural form “mice” will be not be described separately. Similarly “sing” as the lemma for “sing”, “sang”, “sung” will be described, but its tense forms will not. How do we tell a computer that all these words mean the same thing? The word “plant” can have a different meaning depending on the context (e.g. “Tesla is building new plants”, “Climate change has a negative effect on plants”). Vector semantics is currently the best approach to building a computational model that successfully deals with the different aspects of word meaning including senses, hyponym, hypernym, antonym, synonym, homonym, similarity, relatedness, lexical fields, lexical frames, connotation. Our apologies for the linguistic jargon. Let’s build up the intuition around vector semantics by looking at the concept of context.

In our example sentence “Tesla is building new plants”, if we count words in the context of the word “plant” in a many other sentences writen by humans, we’ll tend to see words like “build”, “machine”, “worker”, and even “Tesla”. The fact that these words and other similar context words also occur around the word “factory” can help us discover the similarity between “plant” and “factory”. In this case we will not tend to attach the meaning “vegetable” to the word “plant” in our sentence “Tesla is building new plants”. We will rather think that Tesla is building new factories.

Therefore we can define a word by counting what other words occur in its environment, and we can represent the word by a vector, a list of numbers, a point in N-dimensional space. Such a representation is usually called embedding. Computer can use this cheating trick to understand the meaning of words in its context.

Word-document representation

In order to have a better grasp of vector semantics, let’s assume that we have a set of texts (documents) and we want to find documents which are similar to each other. This task is relevant in information retrieval, for example in search engines, where documents are web pages. As illustration, each column in the table below represents one of 4 documents with the following titles: “As You Like It”, “Twelfth Night”, “Julius Caesar”, and “Henry V”. Words which appear in the documents are represented as rows. These words build our vocabulary. The table tells us that the word “battle” occurs 7 times in the document “Julius Caesar”. This table is also called term-document matrix, where each row represents a word in the vocabulary and each column represents a document, a section, a paragraph, a tweet, a SMS, an email or whatever.

source: Jurafsky et al., 2018

Now we can represent each document by a document vector, e.g. [7 62 1 2] for “Julius Caecar”. We can even draw such vectors in a 2-dimensional vector space for any pair of words. Below we have an example of such a graph. We see a spatial visualization of the document vectors of the space built by the dimensions “fool” and “battle”. We can conclude that the documents “Henry V” and “Julius Caesar” have similar content, which is more related to “battle” than to “fool”. For information retrieval we’ll also represent a query by a document vector, also of length 4 telling how often the words “battle”, “good”, “fool” and “wit” appear in the query. The search results will be obtained by comparing the query vector with all four document vectors to find how similar they are.

source: Jurafsky et al., 2018

Word-word representation

By looking at the rows of the term-document matrix, we can extract word vectors instead of column vectors. As we saw that similar documents tend to have similar words, similar words have similar vectors because they tend to occur in similar documents. If we now use words as columns of the term-document matrix, instead of documents, we obtain the so-called word-word matrix, term-term matrix, also called term-context matrix. Each cell describes the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus. A simple case is when the context is a document, so the cell will tell how often two words appear in the same document. A more frequent case is to count how often the column word appears within a words-window around the row word. In the example below “data” appears 6 times in the context of “information” when a 7-words-window around “information” is considered.

source: Jurafsky et al., 2018

The matrix above suggests that “apricot” and “pineapple” are similar to each other because “pinch” and “sugar” tend to appear in their context. Similarity between two words v and w can be exactly computed using their corresponding word vectors by calculating the so-called cosine-similarity, defined as follows:

In the example below, the similarity between “digital” and “information” equals the cosine-similarity between the word vectors [0 1 2] and [1 6 1]. By applying the formula above, we obtain 0.58, which is higher than the cosine-similarity 0.16 between “apricot” and “information”.

Word vectors and cosine-similarity are a powerful tool in dealing with natural language contents. However, there are situations where they do not workout so well, as we will see in the next section.

TF-IDF language model

Vector semantic models use the raw frequency of the co-occurrence of two words. In natural language, raw frequency is very skewed and not very discriminative. As depicted in the histogram below, the word “the” is simply a frequent word and has roughly equivalent high frequencies in each of the documents or contexts.

There are few ways of dealing with this problem. TF-IDF algorithm is by far the dominant way of weighting co-occurrence matrices in natural language processing, especially in information retrieval. The TF-IDF weight is computed as the product of the term frequency and the inverse document frequency. It helps us to assign importance to more discriminative words. The two components used to calculate the TF-IDF weight for each term t in our document d is described below.

Term frequency

The term or word frequency is calculated as the number of times the word appears in the document. Since a word appearing 100 times in a document doesn’t make that word 100 times more likely to be relevant to the meaning of the document, we use the natural logarithm to downweight the raw frequency a bit. Words which occur 10 times in a document would have a tf=2. Words which occur 100 times in a document means tf=3, 1000 times mean tf=4, etc.

Inverse document frequency

The document frequency of a given term or word is the number of documents it occurs in. The inverse document frequency is the ratio of the total number of documents over the document frequency. This gives a higher weight to words that occur only in a few documents. Because of the large number of documents in many collections, a natural logarithm is usually applied to the inverse document frequency in order to avoid skewed distribution of IDF.

Below on the left side of the image, we see the TF and IDF calculated for words from our example introduced previously. On the right side of the image we show the raw frequency of each word in a given document, as well as its weighted TF-IDF value in the bottom table. Because the word “good” appears with high frequency in all documents, its TF-IDF value turns to zero. This allows more weight on the discriminative word “battle”, which originally has very low frequency.

source: Jurafsky et al., 2018

Although Bag-of-words models, augmented with TF-IDF, are great models, there are semantic nuances, they are not able to capture. Let’s show this on the following sentences: “The cat sat on the broken wall, and, “The dog jumped over the brick structure. Although both sentences are presenting two separate events, their semantic meanings are similar to one another. A dog is similar to a cat, because they share an entity called animal. A wall could be viewed as similar to a brick structure. Therefore, while the sentences discuss different events, they are semantically related to one another. In the classical bag-of-words model (where words were encoded in their own dimensions), encoding such a semantic similarity is not possible. Additionally such models exhibit few computational issues when a large vocabulary is used and word vectors become larger. We are introducing a better approach in the next section.

Representing words or documents by sparse and long vectors is not practically efficient. Those vectors are typically sparse because many positions are filled by zero values. They are long because their dimensionality equals the vocabulary size or the documents collection size.

What are word embeddings?

As opposite to sparse vectors, dense vectors may be more successfully included as features in many machine learning systems. Dense vectors also imply less parameters to estimate. In the previous section, we saw how to compute the tf and tf-idf values for any given word in our corpus. If instead of counting how often each word w occurs in the context of another word v, we instead train a classifier on a binary prediction task “Is word w likely to show up near v?”, then we can used the learned classifier weights as the word embeddings. These become continuous vector representations of words. If we want to calculate some semantic similarity between words, we can do it by looking at the angle between two embeddings, i.e their cosine similarity, as we saw previously.

How do we compute word embeddings?

Embedding vectors are typically at least around 100-dimensional long to be effective, which means there are more than tens of millions of numbers that the classifier needs to tweak when we have a 100,000 words vocabulary. How do we determine those embeddings for a given corpus? That’s exactly what the word2vec language model is designed to do.

Word2vec is available in two flavors: the CBOW model and the skip-gram model (proposed by Mikolov, et. al., 2013, at Google (again!)). In the skip-gram variant, the context words are predicted using the target word, while the variant wherein the target word is predicted using the surrounding words is the CBOW model. Typically people will use the word2vec implementation and pre-trained embeddings on large corpora.

We will now develop the intuition behind the skip-gram variant of word2vec model. Let’s assume the sentence below, where “apricot” is the target word and “tablespoon”, “of”, “jam” and “a” are its context words when considering a 2-words window.

We can build a binary classifier which takes any pair of words (t, c) as input and predicts True if c is a real context word for t and False elsewhere. For example, the classifier will return True for (apricot, tablespoon) and False for (apricot, pinch). Below we see more examples from the positive and negative classes when considering apricot as target.

Taking this a step further, we have to split each (context window, target) pair into (input, output) examples where the input is the target and the output is one of the words from the context. For illustration purpose, the pair ([of, apricot, a, pinch], jam) would produce the positive example (jam, apricot), because apricot is a real context-word of jam. (jam, lemon) would be a negative example. We continue to apply this operation to every (context window, target) pair to build our dataset. Finally, we replace each word with its unique index from the vocabulary in order to work with one-hot vector representations.

Given the set of positive and negative training examples, and an initial set of
embeddings W for targets and C for context words, the goal of the classifier is to adjust those embeddings such that we maximize the similarity (dot product) of the positive examples’ embeddings and we minimize the similarity of the negative examples’ embeddings. Typically a logistic regression will solve this task. Alternatively a neural network with softmax is used.

d-dimentional embeddings training on a V-size vocabulary — source: Jurafsky et al., 2018

Why do we like Word2vec?

Surprisingly, the Word2vec embeddings allow for exploring interesting mathematical relationships between words. For example, if we subtract the vector for word “man” from the vector for word “king” and then add to the resulting vector a vector for work “woman”, we will arrive at the vector for word “queen”.

We can also use element-wise addition of embedding vector elements to ask questions such as “German + airlines” and by looking at the closest tokens to the composite vector come up with impressive answers, as shown below.

Word2vec vectors should have similar structure when trained on comparable corpora in different languages, allowing us to perform machine translation.

Embeddings are also used to find the closest words for a user-specified word. They can also detect the gender relationship and plural-singular relationship between words. The word vectors can be also used for deriving word classes from huge data sets, e.g. by performing K-means clustering on top of the word vectors.

In this section, we talked about Word2vec variants CBOW and Skip gram, which are local context window methods. They work well with small amounts of training data and represent even words that are considered rare. However they do not fully exploit the global statistical information regarding word co-occurrences. Therefore, let’s look at another approach that claims to solve this issue.

GloVe (Global Vectors) takes a different approach than Word2vec. Instead of extracting the embeddings from a neural network or a logistic regression that is designed to perform a classification task (predicting neighbouring words), GloVe optimizes the embeddings directly so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other (within a 2-words window, for example). This forces the embeddings vectors to encode the frequency distribution of which words occur near them.

Below we will illustrate how to use words embeddings from a pre-trained GloVe model. We leverage the Python package Gensim.

We use the pre-trained word vectors glove.6B, which was trained on a corpus of 6 billion tokens and contains a vocabulary of 400 thousand words, obtained from different massive web datasets (Wikipedia among others). Each word is represented by a vector of 50, 100, 200 or 300 dimension respectively. We use 100-dimension vectors. The first step is to convert the GloVe file format to the word2vec file format. The only difference is the addition of a small header line. This can be done by calling the glove2word2vec() function. Each line starts with the word, followed by all the values of the corresponding vector for that word, each separated by a space.

As we explained in the previous section, these real-valued word vectors have proven to be useful for all sorts of natural language processing tasks, including parsing, named entity recognition and machine translation. For example, we find that claims like the following hold for the associated word vectors:

king − man + woman ≈ queen

That is, the word “queen” is the closest word given the subtraction of the notion of “man” from “king” and adding the word “woman” . The “man-ness” in “king” is replaced with “woman-ness” to give us “queen”.

>>> result = model.most_similar(positive=['woman', 'king'], 
negative=['man'], 
topn=1)
>>> print(result)[('queen', 0.7698541283607483)]

Using the “analogy” gadget, we can ask which word is to “russia” as “paris” is to “france”?

>>> result = model.most_similar(positive=['russia', 'paris'], 
negative=['france'], 
topn=1)
>>> print(result)[('moscow', 0.8845715522766113)]

Let’s look at the embeddings for the four words “russia”, “moscow”, “france”, “paris”. We can project the 100-words vectors into a 2D space using the first 2 components of their PCA transformation. We can also make a similar plot using the t-Distributed Stochastic Neighbor Embedding (t-SNE) method. The PCA plot below shows that a line between “france” and “moscow” would be almost parallel to a line between “france” and “paris”. With t-SNE we obtain lines which are more parallel compared to those obtained with PCA. t-SNE is a technique for dimensionality reduction and is particularly well suited for the visualization of high-dimensional datasets. Contrary to PCA it is not a mathematical technique but a probabilistic one.

The 100-dimension vectors are also very good at answering analogy questions of the form: “a is to b as c is to ?”. For example:

  • what is to russia what paris is to france?
  • what is to woman what king is to man?
  • what is to dog what kitten is to cat?
  • what is to chef what stone is to sculptor?
  • what is to bees what den is to bears?

The plot above clearly shows that Word2vec vectors are able to find “Moscow” as the capital of “Russia” given that “Paris” is the capital of “France”. Similarly “hive” is to “bees” what “den” is to “bears”.

In this article we motivated why representing text in numerical format is important for natural language processing. We also introduced the most commonly used approaches for doing so, by gently and intuitively covering the key aspects of the underlying algorithms. We covered several language models: one-hot encoding, N-gram, bag-of-words, td-idf, word2vec and glove. There are few popular language models which were not introduced. For example, fastText, an extension of the word2vec model, which instead of learning vectors for words directly, represents each word as an n-gram of characters. This is particularly useful for detecting suffixes and prefixes. Another area not explored in this article is topic modeling. There are various techniques for topic modeling and most of them involve some form of matrix decomposition of the term-document matrix (Latent Semantic Indexing, Latent Dirichlet Allocation).

Representing text properly can be really helpful in making new friends. In order to be able to properly capitalize our friend’s name in the sentence below, we need to recognize the first “stone” as a person name and the second as an object. This topic is further developed in my other article Truecasing in natural language processing.

If you are really interested by the problem of representing natural language text, I would recommend the following book as further reading: Speech and Language Processing, 3rd Ed. by Dan Jurafsky and James Martin, 2018. I assiduously used insights from that book in this article.

Thank you for reading.