Expressing and processing the nuance and wildness of language—while achieving the strong transfer of information that language is intended to achieve—makes representing words an endlessly fascinating problem.

To better compute with words, we need to represent them as vectors. So we move to some methods. Remember our goal.

Goal

Learn rich representations of complex objects from data to get the word vectors

Independent Word Vectors

One-hot Vectors: The simplest way to represent words is to use a vector of length , where is the number of words in the vocabulary. Each word is represented by a vector of length , where the element of the vector is 1 if the word is the word in the vocabulary, and 0 otherwise.

For example, a vocabulary set will be represented as:

Limitation: Well, one-hot vector encoding is very simple to build, but it has a big problem—sparsity—every two words are orthogonal with NO similarities at all. But in fact there are many words that share the similar meaning or context. Then we go next step.


Human-annotated Word Vectors

There is grammatical information, like plurality, there’s derivational information, like how the runners is something like the verb to run plus a notion of “doer”, or agent (think one who runs.) There’s also semantic information, like how runners might be a hyponym of humans, or animals, or entities. (A hyponym is a member of an is-a relationship; e.g., a runner is a human.)

  1. WordNet 1: it annotates for synonyms, hyponyms, and other semantic relations;
  2. UniMorph 2: it annotates for morphology (subword structure) information across many languages.

Limitation

  • Updating these resources is costly and they’re always incomplete.
  • A very high-dimensional vector (much larger than the vocabulary size) to represent all of these categories.

Distributional Vectors

Take part of the data (maybe a word in a sentence) and attempt to predict other parts of the data (other words) with it. While simple, this is one of the most influential and successful ideas in all of modern NLP.

Co-occurrence matrices

This is a document-level method. It is a matrix (, a row in the matrix) where each row and column represents a word in the corpus. The cell (i, j) represents the number of times word i and word j appear together in a specific window size throughout the corpus.

Feature

  • Larger notions of co-occurrence (e.g., large windows or documents) lead to more semantic or even topic - encoding representations; shorter windows lead to more syntax - encoding representations.
  • It’s a sparse representation based on statistics.

Limitation

  • - sized vectors are unweildy for large vocabularies (usually > 10k).
  • Overemphesize of the very common words like: The, a, etc.

Pointwise mutual information (PMI) matrix

PMI3 is the log of the ratio of the joint probability of a word pair to the product of the individual probabilities of each word.

For original PMI, when , we set it to 0 in a common way. A more consistent way is to use the PPMI (positive PMI)4

Limitation:

  • A rare word pair may have a high PMI, but it’s not a good representation of the word pair.
  • For example, A rare context that co-occurred with a target word even once, will often yield relatively high PMI score because

Latent semantic analysis (LSA)5

In traditional LSA, we use the co-occurrence matrix (originally, the row represents word and the colomn represents doc) to compute a lower dimension vector for word. But we can also use PMI matrix to do so.

Word2Vec model6

Ok, rethink about the problem we are facing now and our goal of building word vector. We want it to be:

1) Smaller (better in hundreds dimensions)
2) and Representing Ability (syntatic & semantic)

Skip-gram

Idea

Predict the context words using the center word.

graph TD
    loves((loves))
    loves--> It((It))
    loves --> seems((seems))
    loves --> she((she))
    loves --> you((you))
    loves --> so((so))
    loves --> much((much))
  • Given a corpus, build a low-dimensional vector representation for each word.
  1. divide the corpus into sentences.
  2. set a window size.
  3. extract the center word and the context words around it within the window size.
  4. build two embedding matrices (low dimension), for center words and for context words.
  5. OUR GOAL: find the best and make the context words close to the center words which is resonable semantically.
  6. suppose that these words within a window size in a document should have something meaningful in common. So the probability of the co-occurrence of these words shall be as high as it could be. Then do some math ↓ .

? ?

  • Indeed, there will be two embedding matrices, so each word has two embedding vectors: one as center word and the other as context word.
  • However, practically we only use center word embedding matrix or we use average of and .

Algorithm

for some convenience, here as the center word and as the context words, equals that ; is the window size and is the start index of the window. is the length of the sentence and is one sentence in the corpus .

  1. Objective is the probability of context word given center word as:
  • Then we want to maximize the probability:
  1. Maximize , which is equivalent to minimize the negative log-likelihood, and for the whole corpus , define the negative log-likelihood Loss:
  1. Work out the gradient of Loss with respect to :
  1. Propogate the gradient gradually (change back), and with Eqn4 we can get:
  1. Part A:
  1. Part B:

> and both represent the each word in the vocabulary, counting independently in the same equation.

  1. Put two parts together, we get the derivative of Loss with respect to :

Summary

  • Now that we’ve got a very simple model as shown in Eqn 1~9.
  • However when we implement it on computers, it’s not possible to go through all the words in the vocabulary for each center word to calculate .
  • Negative sampling is a method to address this problem.
  • Discussed in Traing methods

Continuous Bag-of-Words (CBOW)

Idea

Predict the center word using the context words.

graph TD
    loves((loves))
    It((It)) --> loves
    seems((seems)) --> loves
    she((she)) --> loves
    you((you)) --> loves
    so((so)) --> loves
    much((much)) --> loves

Algorithm CBOW is similar to Skip-gram, but predicting is inverse.

Similarly

  1. Objective:
  1. Suppose and , then we can get the objective:
  1. Negative log-likelihood Loss (take one element):
  1. Gradient:

Global Vectors (GloVe) 7

GloVe Idea

This is a great idea that combines both matrix factorization and shallow window-based methods.

  1. Notation: Let the matrix of word-word co-occurrence counts be denoted by , whose entries tabulate the number of times word j occurs in the context of word i. Let be the number of times any word appears in the context of word i. Finally, let be the probability that word j appear in the context of word i.
  2. Co-occurrence probabilities: If we take three words like , and , then we expect the ratio to be some value bigger when k is close to i and smaller when k is close to j.

Where F is . We can get the relations between word vectors and the co-occurrence counts ↓ Objective

For Eqn. 17, there is one drawback: it weighs the co-occurrence counts equally. But in fact we want give less weight to less common words and fixed weights to the most common words (like the, a).

  1. Loss:

Where and

Relations with Word2Vec

  1. The Eqn.5 can be:
  1. If we use the co-occrence matrix , we can get:
  1. While cross_entropy is not good to capture the distribution with long tails. Then introduce the least square loss:

Training Methods 8

  1. Dynamic context window: The window size is dynamic and the weights of the context words are different. For example, a size-5 window will weigh its contexts by 5/5, 4/5, 3/5, 2/5, 1/5.
  2. Subsampling: This is used to diluting very frequent words with a threshold of . .
  3. Delete rare words.
  4. Context distribution smoothing: In order to smooth the original contexts’ distribution, all context counts are raised to the power of
  5. Number of negative samples:
  6. Add context vectors or not
  7. Vector normalization: None, Row, Column, Both

Negative sampling 9

Let’s re-look at the probability of the context word given the center word :

  1. Notice that:
  • The so-called probability here is not the same with the statistical probability. We just borrow the concept and want is as high as possible. There are two components:
  • The function ensure the value to be positive while the partition term is to make the sum of all the values to be 1. The non-negative function can also be done by sigmoid .
  • encourages the model to make more like ; encourages the model to make all the other for less like . We could still use the window sized context words to encorage but only sample other words in the vocabulary to discourage instead.
  1. Then the Objective could be:
  1. Negative log-likelihood Loss (take one element):

Evaluation 10

There are two schemes to evaluate the model:

  1. Extransic
    • Use word embeddings as input to a downstream task and evaluate the performance like part-of-speech tagging and named-entity recognition.
    • It is not clear how it connects to other measures.
  2. Intransisic
    • Test the syntactic or semantic relationship between words directly using query inventories.

Cosine similarity

Absolute intransic evaluation

  1. Relatedness These datasets contain relatedness scores for pairs of words; the cosine similarity of the embeddings for two words should have high correlation (Spearman or Pearson) with human relatedness scores.

  2. Analogy This task was popularized by Mikolov. The goal is to find a term x for a given term y so that x : y best resembles a sample relationship a : b.

  3. Categorization The goal is to recover a clustering of words into different categories. To do this, the corresponding word vectors of all words in a dataset are clustered and the purity of the returned clusters is computed with respect to the labeled dataset.

  4. Selectional preference The goal is to determine how typical a noun is for a verb either as a subject or as an object (e.g., people eat, but we rarely eat people).

For Relatedness

  • Query inventory: WordSim, MEN.
  • (1) the frequency of rare words (2) the parts of speech of words and (3) abstractness vs. concreteness.

Comparative intransic evaluation

In comparative evaluation, users give direct feedback on the embeddings themselves, so we do not have to define a metric that compares scored word pairs.

Extrinsic evaluation

Extrinsic evaluations measure the contribution of a word embedding model to a specific task. There is an implicit assumption in the use of such evaluations that there is a consistent, global ranking of word embedding quality, and that higher quality embeddings will necessarily improve results on any downstream task. Different tasks favor different embeddings

Code Example

Simple code for skip-gram & CBOW with negative sampling

References

Footnotes

  1. WordNet: a lexical database for English

  2. UniMorph 4.0: Universal Morphology

  3. Word association norms, mutual information, and lexicography

  4. Extracting Semantic Representations from Word Co-occurrence Statistics: A Computational Study

  5. Latent Semantic Analysis

  6. Efficient Estimation of Word Representations in Vector Space

  7. GloVe: Global Vectors for Word Representation

  8. Improving Distributional Similarity with Lessons Learned from Word Embeddings

  9. Negative Sampling for Word Embeddings

  10. Evaluation methods for unsupervised word embeddings