Content

In Assignment 3, we will explore the reason why using multihead self-attention and the positional encoding and the little but powerful pre-trained model minGPT

Attention exploration

For a simple single-head self-attention, we can use the following formula to calculate the attention score:

c α_{i} = i = 1 \sum n v_{i} α_{i} = \frac{e x p ( k _{i}^{⊤} ) q}{\sum _{j = 1}^{n} e x p ( k _{j}^{⊤} ) q} (1)

Single head

Suppose that all the keys are orthogonal, i.e., $k_{i}^{⊤} k_{j} = 0$ for all $i, j$ .

If $q$ is very closely aligned to $k_{i}$ , then the attention score will be very high. The output vector $c$ will be very similar to $v_{i}$ .

Think that we have a query vector $q$ incorporating information from two vectors $v_{a}$ and $v_{b}$ with corresponding keys $k_{a}$ and $k_{b}$ :

q = s \cdot \frac{k _{a} + k _{b}}{∥ k _{a} + k _{b} ∥} (2)

Where $s$ is a large positive scalar. By doing this combination, we can achive a value $c \approx v_{a} + v_{b}$ .

However, if the key vector, let’s say $k_{a}$ has large veriance. And when $k_{a}$ is a vector with great magnitude, the query vector $q$ will be very close to $k_{a}$ , and the attention score will be very high. The output vector $c$ will be very similar to $v_{a}$ .

Muilti-head

Suppose we have two heads, so we can use the following formula to calculate the attention score combined the two values as before:

q_{1} q_{2} = s \cdot \frac{μ _{a}}{∥ μ _{a} ∥} = s \cdot \frac{μ _{b}}{∥ μ _{b} ∥} (3)

In this case, even $μ_{a}$ sometimes can reach a very high value, the only impact is that $c_{a}$ will be focusing on $v_{a}$ . The output vector $c$ will still be very similar to $μ_{a} + μ_{b}$ .

Positional embedding

Note

Permuting the input will not change the attention score without positional embedding. Basically, when computing the attention score, if the input is permuted, the attention score will be permuted as same way.

$H_{p er m} = P H$

This will be problematic because the meaning of a sentence often depends on the order of the words. For example, “The cat sat on the mat” has a different meaning than ”On the mat sat the cat”. If the model cannot distinguish between these two sequences due to its permutation invariance, it may fail to capture important syntactic and semantic information necessary for understanding and generating coherent text.

Fixed method

Φ_{(t, 2 i)} Φ_{(t, 2 i + 1)} = sin (t /1000 0^{2 i / d}) = cos (t /1000 0^{2 i / d}) (4)

Where $t \in 0, 1, \dots, T - 1$ and $i \in 0, 1, \dots, d /2 - 1$

Learnable method

In some cases, we can just set the positional encoding as a learnable parameter.

Rotary Positional Encoding

R o PE (x_{t}^{(1)}, x_{t}^{(2)}, t) = (cos tθ s in tθ - s in tθ cos tθ) (x_{t}^{(1)} x_{t}^{(2)}) (5)

Where $t$ is the position index and $θ$ is the angle: $θ = 1000 0^{- 2 (i - 1) / d}$

cos t θ_{1} sin t θ_{1} 00 ⋮ 00 - sin t θ_{1} cos t θ_{1} 00 ⋮ 00 00 cos t θ_{2} sin t θ_{2} ⋮ 00 00 - sin t θ_{2} cos t θ_{2} ⋮ 00 \dots \dots \dots \dots ⋱ \dots \dots 0000 ⋮ cos t θ_{d /2} sin t θ_{d /2} 0000 ⋮ - sin t θ_{d /2} cos t θ_{d /2} x_{t}^{(1)} x_{t}^{(2)} x_{t}^{(3)} x_{t}^{(4)} ⋮ x_{t}^{(d - 1)} x_{t}^{(d)} (6)

cos t θ_{1} + i sin t θ_{1} cos t θ_{2} + i sin t θ_{2} ⋮ cos t θ_{d /2} + i sin t θ_{d /2} ⊙ x_{t}^{(1)} + i x_{t}^{(2)} x_{t}^{(3)} + i x_{t}^{(4)} ⋮ x_{t}^{(d - 1)} + i x_{t}^{(d)} (7)

The above Equa.(6) and Equa.(7) is equivalent. And this positional embedding is a relative methods.

Language pre-trained model

In this part, we have several tasks to do. All the details are in the Code part.

Implemnent finetuning and make predictions;
Define a span corruption function;
Pretrain, finetune and make predictions;
Implement RoPE function;
Pretrain, finetune and make predictions;

Results

Type	Accuracy
Baseline	5.00 %
Vallina(no pretraining)	1.40 %
Vallina(pretraining)	23.20 %
RoPE	38.60 %

Note

Tasks today shall be done today.

Code

Assign 4 Completion

🌲vsk_dl_notes

Explorer

Assign 4 Transformer

Content

Attention exploration

Single head

Muilti-head

Positional embedding

Fixed method

Learnable method

Rotary Positional Encoding

Language pre-trained model

Results

Code

Graph View

Table of Contents