Expressing and processing the nuance and wildness of language—while achieving the strong transfer of information that language is intended to achieve—makes representing words an endlessly fascinating problem.

To better compute with words, we need to represent them as vectors. So we move to some methods. Remember our goal.

Goal

Learn rich representations of complex objects from data to get the word vectors

Independent Word Vectors

One-hot Vectors: The simplest way to represent words is to use a vector of length $V$ , where $V$ is the number of words in the vocabulary. Each word is represented by a vector of length $V$ , where the $i - t h$ element of the vector is 1 if the word is the $i - t h$ word in the vocabulary, and 0 otherwise.

For example, a vocabulary set ${..., t e a, ..., co ff ee, ..., c a t, ..., d o g}$ will be represented as:

v_{tea} = [0, 0, 1, \dots, 0] v_{coffee} = [\dots, 0, 0, 1, \dots] (1)

v_{t e a}^{⊤} v_{co ff ee} = v_{co ff ee}^{⊤} v_{t e a} = 0 (2)

Limitation: Well, one-hot vector encoding is very simple to build, but it has a big problem—sparsity—every two words are orthogonal with NO similarities at all. But in fact there are many words that share the similar meaning or context. Then we go next step.

Human-annotated Word Vectors

There is grammatical information, like plurality, there’s derivational information, like how the runners is something like the verb to run plus a notion of “doer”, or agent (think one who runs.) There’s also semantic information, like how runners might be a hyponym of humans, or animals, or entities. (A hyponym is a member of an is-a relationship; e.g., a runner is a human.)

WordNet ¹: it annotates for synonyms, hyponyms, and other semantic relations;
UniMorph ²: it annotates for morphology (subword structure) information across many languages.

v_{tea} = 001 ⋮ 1 (plural noun) (3rd singular verb) (hyponym-of-beverage) ⋮ (synonym-of-chai) (3)

Limitation

Updating these resources is costly and they’re always incomplete.
A very high-dimensional vector (much larger than the vocabulary size) to represent all of these categories.

Distributional Vectors

Take part of the data (maybe a word in a sentence) and attempt to predict other parts of the data (other words) with it. While simple, this is one of the most influential and successful ideas in all of modern NLP.

Co-occurrence matrices

This is a document-level method. It is a matrix $X$ ( $X_{t e a} \in R^{∣ V ∣}$ , a row in the matrix) where each row and column represents a word in the corpus. The cell (i, j) represents the number of times word i and word j appear together in a specific window size throughout the corpus.

[I t ’ s h o t an d d e l i c i o u s . [I p o u re d [t h e ce n t er w or d t e a f or]_{1} m y u n c l e]_{3}]_{d oc u m e n t}

Feature

Larger notions of co-occurrence (e.g., large windows or documents) lead to more semantic or even topic - encoding representations; shorter windows lead to more syntax - encoding representations.
It’s a sparse representation based on statistics.

Limitation

$V$ - sized vectors are unweildy for large vocabularies (usually > 10k).
Overemphesize of the very common words like: The, a, etc.

Pointwise mutual information (PMI) matrix

PMI³ is the log of the ratio of the joint probability of a word pair to the product of the individual probabilities of each word.

For original PMI, when $PM I (w, c) = lo g 0 = - \infty$ , we set it to 0 in a common way. A more consistent way is to use the PPMI (positive PMI)⁴

PMI (w, c) PPMI (w, c) = ln \frac{p ( w , c )}{p ( w ) p ( c )} = ln \frac{N ( w , c ) \cdot N}{N ( w ) , N ( c )} = max (PMI (w, c),, 0) (4)

Limitation:

A rare word pair may have a high PMI, but it’s not a good representation of the word pair.
For example, A rare context $c$ that co-occurred with a target word $w$ even once, will often yield relatively high PMI score because $P (c)$

Latent semantic analysis (LSA)⁵

In traditional LSA, we use the co-occurrence matrix (originally, the row represents word and the colomn represents doc) to compute a lower dimension vector for word. But we can also use PMI matrix to do so.

Word2Vec model⁶

Ok, rethink about the problem we are facing now and our goal of building word vector. We want it to be:

1) Smaller (better in hundreds dimensions)
2) and Representing Ability (syntatic & semantic)

Skip-gram

Idea

Predict the context words using the center word. $P (^{'} i t^{'},^{'} see m s^{'},^{'} s h e^{'},^{'} yo u^{'},^{'} s o^{'},^{'} m u c h^{'} ∣^{'} l o v e s^{'})$

graph TD
    loves((loves))
    loves--> It((It))
    loves --> seems((seems))
    loves --> she((she))
    loves --> you((you))
    loves --> so((so))
    loves --> much((much))

Given a corpus, build a low-dimensional vector representation for each word.

divide the corpus into sentences.
set a window size.
extract the center word and the context words around it within the window size.
build two embedding matrices (low dimension), $V$ for center words and $U$ for context words.
OUR GOAL: find the best $V$ and $U$ make the context words close to the center words which is resonable semantically.
suppose that these words within a window size in a document should have something meaningful in common. So the probability of the co-occurrence of these words shall be as high as it could be. Then do some math ↓ .

$V$ ? $U$ ?

Indeed, there will be two embedding matrices, so each word has two embedding vectors: one as center word and the other as context word.

However, practically we only use center word embedding matrix $V$ or we use average of $U$ and $V$ .

Algorithm

for some convenience, here $w_{i} = v_{c}$ as the center word and $w_{i + j} = u_{o}$ as the context words, equals that $P_{U, V} (u_{o} ∣ v_{c}) = P_{U, V} (w_{i + j} ∣ w_{i})$ ; $k$ is the window size and $i - k$ is the start index of the window. $T$ is the length of the sentence $d$ and $d$ is one sentence in the corpus $D$ .

Objective is the probability of context word $u_{o}$ given center word $v_{c}$ as:

P_{U, V} (u_{o} ∣ v_{c}) = \frac{e x p ( u _{o}^{⊤} v _{c} )}{\sum _{w \in V} e x p ( u _{w}^{⊤} v _{c} )} (4a)

Then we want to maximize the probability:

d \in D \prod i = 1 \prod T j = - k, j \neq = 0 \prod k P_{U, V} (w_{i + j} ∣ w_{i}) (4b)

Maximize $P_{U, V} (u_{o} ∣ v_{c})$ , which is equivalent to minimize the negative log-likelihood, and for the whole corpus $D$ , define the negative log-likelihood Loss:

L (U, V) = - d \in D \sum i = 1 \sum T j = - k, j \neq = 0 \sum k lo g P_{U, V} (w_{i + j} ∣ w_{i}) (5)

Work out the gradient of Loss with respect to $v_{c}$ :

\nabla_{v_{c}} L (U, V) = - d \in D \sum i = 1 \sum T j = - k, j \neq = 0 \sum k f oc u se o n t hi s \nabla_{v_{c}} lo g P_{U, V} (w_{i + j} ∣ w_{i}) (6)

Propogate the gradient gradually (change $P_{U, V} (w_{i + j} ∣ w_{i}) = P_{U, V} (u_{o} ∣ v_{c})$ back), and with Eqn4 we can get:

\nabla_{v_{c}} lo g P_{U, V} (u_{o} ∣ v_{c}) = \nabla_{v_{c}} lo g \frac{exp ( u _{o}^{⊤} v _{c} )}{\sum _{w \in V} exp ( u _{w}^{⊤} v _{c} )} = P a r t A \nabla_{v_{c}} lo g exp (u_{o}^{⊤} v_{c}) - P a r t B \nabla_{v_{c}} lo g w \in V \sum exp (u_{w}^{⊤} v_{c}) (7)

Part A:

\nabla_{v_{c}} lo g exp (u_{o}^{⊤} v_{c}) = \nabla_{v_{c}} u_{o}^{⊤} v_{c} = u_{o} (8a)

Part B:

\nabla_{v_{c}} lo g w \in V \sum exp (u_{w}^{⊤} v_{c}) = \frac{1}{\sum _{w \in V} exp ( u _{w}^{⊤} v _{c} )} \nabla_{v_{c}} x \in V \sum exp (u_{x}^{⊤} v_{c}) = \frac{1}{\sum _{w \in V} exp ( u _{w}^{⊤} v _{c} )} x \in V \sum \nabla_{v_{c}} exp (u_{x}^{⊤} v_{c}) = \frac{1}{\sum _{w \in V} exp ( u _{w}^{⊤} v _{c} )} x \in V \sum exp (u_{x}^{⊤} v_{c}) \nabla_{v_{c}} u_{x}^{⊤} v_{c} = \frac{1}{\sum _{w \in V} exp ( u _{w}^{⊤} v _{c} )} x \in V \sum exp (u_{x}^{⊤} v_{c}) u_{x} = x \in V \sum Eq n 4! \frac{exp ( u _{x}^{⊤} v _{c} )}{\sum _{w \in V} exp ( u _{w}^{⊤} v _{c} )} u_{x} = E x p ec t a t i o n x \in V \sum P_{U, V} (u_{x} ∣ v_{c}) u_{x} = E [u_{x}] (x \in V) (8b)

> $w$ and $x$ both represent the each word in the vocabulary, counting independently in the same equation.

Put two parts together, we get the derivative of Loss with respect to $v_{c}$ :

\nabla_{v_{c}} L (U, V) = d \in D \sum i = 1 \sum T j = - k, j \neq = 0 \sum k - (u_{o} - E [u_{w}]) (w \in V) (9)

Summary

Now that we’ve got a very simple model as shown in Eqn 1~9.

However when we implement it on computers, it’s not possible to go through all the words in the vocabulary for each center word to calculate $P_{U, V} (u_{w} ∣ v_{c}) u_{w}$ .

Negative sampling is a method to address this problem.

Discussed in Traing methods

Continuous Bag-of-Words (CBOW)

Idea

Predict the center word using the context words. $P (^{'} l o v e s^{'} ∣^{'} i t^{'},^{'} see m s^{'},^{'} s h e^{'},^{'} yo u^{'},^{'} s o^{'},^{'} m u c h^{'})$

graph TD
    loves((loves))
    It((It)) --> loves
    seems((seems)) --> loves
    she((she)) --> loves
    you((you)) --> loves
    so((so)) --> loves
    much((much)) --> loves

Algorithm CBOW is similar to Skip-gram, but predicting is inverse.

Similarly

Objective:

P_{U, V} (u_{c} ∣ v_{o_{1}}, \dots, v_{o_{2 k}}) = \frac{e x p ( \frac{1}{2 k} u _{c}^{⊤} ( v _{o_{1}} + , \dots , v _{o_{2 k}} ))}{\sum _{w \in V} e x p ( \frac{1}{2 k} u _{w}^{⊤} ( v _{o_{1}} + , \dots , v _{o_{2 k}} ))} (12)

Suppose $W_{o} = {o_{1}, \dots, o_{2 k}}$ and $\overset{v}{ˉ}_{o} = \frac{1}{2 k} (v_{o_{1}} +, \dots, v_{o_{2 k}})$ , then we can get the objective:

P_{U, V} (u_{c} ∣ W_{o}) = \frac{e x p ( u _{c}^{⊤} v ˉ _{o} )}{\sum _{w \in V} e x p ( u _{w}^{⊤} v ˉ _{o} )} (13)

Negative log-likelihood Loss (take one element):

L (U, V) = - lo g P_{U, V} (u_{c} ∣ W_{o}) = - (u_{c}^{⊤} \overset{v}{ˉ}_{o} - lo g w \in V \sum e x p (u_{w}^{⊤} \overset{v}{ˉ}_{o})) (14)

Gradient:

\nabla_{\overset{v}{ˉ}_{o}} L (U, V) = - \frac{1}{2 m} (u_{c} - x \in V \sum P_{U, V} (u_{x} ∣ W_{o}) u_{x}) = - \frac{1}{2 m} (u_{c} - E [u_{w}]) (15)

Global Vectors (GloVe) ⁷

GloVe Idea

This is a great idea that combines both matrix factorization and shallow window-based methods.

Notation: Let the matrix of word-word co-occurrence counts be denoted by $X$ , whose entries $X_{ij}$ tabulate the number of times word j occurs in the context of word i. Let $X_{i} = \sum_{k} X_{ik}$ be the number of times any word appears in the context of word i. Finally, let $P_{ij} = P (j ∣ i) = X_{ij} / X_{i}$ be the probability that word j appear in the context of word i.
Co-occurrence probabilities: If we take three words like $i = i ce$ , $j = s t e am$ and $k = so l i d, g a s$ , then we expect the ratio $P_{ik} / P_{jk}$ to be some value bigger when k is close to i and smaller when k is close to j.

\frac{P _{ik}}{P _{jk}} = F (w_{i}, w_{j}, \tilde{w_{k}}) = F ((w_{i} - w_{j})^{⊤} \tilde{w_{k}}) = \frac{F ( w _{i}^{⊤} w _{k} ~ )}{F ( w _{j}^{⊤} w _{k} ~ )} = \frac{P _{ik} / P _{k}}{P _{jk} / P _{k}} (16)

Where F is $e x p$ . We can get the relations between word vectors and the co-occurrence counts ↓ Objective

w_{i}^{⊤} \tilde{w_{k}} w_{i}^{⊤} \tilde{w_{k}} + b_{i} + \tilde{b_{k}} = lo g (P_{ik}) = lo g (X_{ik}) - lo g (X_{i}) \to t o er a se l o g (X_{i}) an d k ee p sy mm e t ry = lo g X_{ik} (17)

For Eqn. 17, there is one drawback: it weighs the co-occurrence counts equally. But in fact we want give less weight to less common words and fixed weights to the most common words (like the, a).

Loss:

J = i, j = 1 \sum V f (X_{ij}) (w_{i}^{⊤} \tilde{w_{j}} + b_{i} + \tilde{b_{j}} - lo g X_{ij})^{2} f (x) = {(x / x_{ma x})^{α}, i f x < x_{ma x} 1, o t h er w i se (18)

Where $α = 3/4$ and $x_{ma x} = 100$

Relations with Word2Vec

The Eqn.5 can be:

J = - i \in cor p u s j \in co n t e x t (i) \sum lo g Q_{ij} (19)

If we use the co-occrence matrix $X$ , we can get:

J = - i = 1 \sum V j = 1 \sum V X_{ij} lo g Q_{ij} = - i = 1 \sum V X_{i} j = 1 \sum V P_{ij} lo g Q_{ij} = i = 1 \sum V X_{i} corross_e n t ro p y (P_{i}, Q_{i}) (20)

While cross_entropy is not good to capture the distribution with long tails. Then introduce the least square loss:

\hat{J} \hat{J} = i, j \sum X_{i} (\hat{P_{ij}} - \hat{Q_{ij}})^{2} = i, j \sum X_{i} (w_{i}^{⊤} \tilde{w_{j}} - lo g X_{ij})^{2} = i, j \sum f (X_{ij}) (w_{i}^{⊤} \tilde{w_{j}} (+ b_{i} + \tilde{b_{j}}) - lo g X_{ij})^{2} (21)

Training Methods ⁸

Dynamic context window: The window size is dynamic and the weights of the context words are different. For example, a size-5 window will weigh its contexts by 5/5, 4/5, 3/5, 2/5, 1/5.
Subsampling: This is used to diluting very frequent words with a threshold of $t$ . $p = 1 - \frac{t}{f}$ .
Delete rare words.
Context distribution smoothing: In order to smooth the original contexts’ distribution, all context counts are raised to the power of $α = 0.75$
Number of negative samples: $N = 5$
Add context vectors or not
Vector normalization: None, Row, Column, Both

Negative sampling ⁹

Let’s re-look at the probability of the context word $u_{o}$ given the center word $v_{c}$ :

P_{U, V} (u_{o} ∣ v_{c}) = \frac{e x p ( u _{o}^{⊤} v _{c} )}{\sum _{w \in V} e x p ( u _{w}^{⊤} v _{c} )} (22)

Notice that:

The so-called probability here is not the same with the statistical probability. We just borrow the concept and want $P_{U, V} (u_{o} ∣ v_{c})$ is as high as possible. There are two components:
The $e x p$ function ensure the value to be positive while the partition term is to make the sum of all the $e x p$ values to be 1. The non-negative function can also be done by sigmoid $σ = \frac{1}{1 + e x p ( - x )}$ .
$e x p (u_{o}^{⊤} v_{c})$ encourages the model to make $u_{o}$ more like $v_{c}$ ; $\sum_{w \in V} e x p (u_{w}^{⊤} v_{c})$ encourages the model to make all the other $u_{w}$ for $w \neq = o$ less like $v_{c}$ . We could still use the window sized context words to encorage but only sample $k$ other words in the vocabulary to discourage instead.

Then the Objective could be:

P_{U, V} (u_{o} ∣ v_{c}) = σ (u_{o}^{⊤} v_{c}) l = 1 \sum k σ (- u_{l}^{⊤} v_{c}) (23)

Negative log-likelihood Loss (take one element):

L (U, V) = - lo g P_{U, V} (u_{o} ∣ v_{c}) = - lo g (σ (u_{o}^{⊤} v_{c}) l = 1 \sum k σ (- u_{l}^{⊤} v_{c})) = - lo g σ (u_{o}^{⊤} v_{c}) - lo g l = 1 \sum k σ (- u_{l}^{⊤} v_{c}) (24)

Evaluation ¹⁰

There are two schemes to evaluate the model:

Extransic
- Use word embeddings as input to a downstream task and evaluate the performance like part-of-speech tagging and named-entity recognition.
- It is not clear how it connects to other measures.
Intransisic
- Test the syntactic or semantic relationship between words directly using query inventories.

Cosine similarity

s imi l a r i t y (a, b) = \frac{a \cdot b}{∣∣ a ∣∣ \cdot ∣∣ b ∣∣} (25)

Absolute intransic evaluation

Relatedness These datasets contain relatedness scores for pairs of words; the cosine similarity of the embeddings for two words should have high correlation (Spearman or Pearson) with human relatedness scores.
Analogy This task was popularized by Mikolov. The goal is to find a term x for a given term y so that x : y best resembles a sample relationship a : b.
$CosAdd a r g ma x cos (b *, a * - a + b) = a r g ma x cos (b *, a *) - cos (b *, a) + cos (b *, b) CosMul = a r g ma x \frac{cos ( b * , a * ) cos ˙ ( b * , b )}{cos ( b * , a ) + ε} (26)$
Categorization The goal is to recover a clustering of words into different categories. To do this, the corresponding word vectors of all words in a dataset are clustered and the purity of the returned clusters is computed with respect to the labeled dataset.
Selectional preference The goal is to determine how typical a noun is for a verb either as a subject or as an object (e.g., people eat, but we rarely eat people).

For Relatedness

Query inventory: WordSim, MEN.
(1) the frequency of rare words (2) the parts of speech of words and (3) abstractness vs. concreteness.

Comparative intransic evaluation

In comparative evaluation, users give direct feedback on the embeddings themselves, so we do not have to define a metric that compares scored word pairs.

Extrinsic evaluation

Extrinsic evaluations measure the contribution of a word embedding model to a specific task. There is an implicit assumption in the use of such evaluations that there is a consistent, global ranking of word embedding quality, and that higher quality embeddings will necessarily improve results on any downstream task. Different tasks favor different embeddings

Code Example

Simple code for skip-gram & CBOW with negative sampling

🌲vsk_dl_notes

Explorer

N1 Representing Words

Independent Word Vectors

Human-annotated Word Vectors

Distributional Vectors

Co-occurrence matrices

Pointwise mutual information (PMI) matrix

Latent semantic analysis (LSA)⁵

Word2Vec model⁶

Skip-gram

Continuous Bag-of-Words (CBOW)

Global Vectors (GloVe) ⁷

GloVe Idea

Relations with Word2Vec

Training Methods ⁸

Negative sampling ⁹

Evaluation ¹⁰

Absolute intransic evaluation

Comparative intransic evaluation

Extrinsic evaluation

Code Example

References

Graph View

Table of Contents

Backlinks

🌲vsk_dl_notes

Explorer

N1 Representing Words

Independent Word Vectors

Human-annotated Word Vectors

Distributional Vectors

Co-occurrence matrices

Pointwise mutual information (PMI) matrix

Latent semantic analysis (LSA)5

Word2Vec model6

Skip-gram

Continuous Bag-of-Words (CBOW)

Global Vectors (GloVe) 7

GloVe Idea

Relations with Word2Vec

Training Methods 8

Negative sampling 9

Evaluation 10

Absolute intransic evaluation

Comparative intransic evaluation

Extrinsic evaluation

Code Example

References

Footnotes

Graph View

Table of Contents

Backlinks

Latent semantic analysis (LSA)⁵

Word2Vec model⁶

Global Vectors (GloVe) ⁷

Training Methods ⁸

Negative sampling ⁹

Evaluation ¹⁰