Content

In Assignment 3, we will explore the reason why using multihead self-attention and the positional encoding and the little but powerful pre-trained model minGPT

Attention exploration

For a simple single-head self-attention, we can use the following formula to calculate the attention score:

Single head

Suppose that all the keys are orthogonal, i.e., for all .

  • If is very closely aligned to , then the attention score will be very high. The output vector will be very similar to .

Think that we have a query vector incorporating information from two vectors and with corresponding keys and :

Where is a large positive scalar. By doing this combination, we can achive a value .

  • However, if the key vector, letโ€™s say has large veriance. And when is a vector with great magnitude, the query vector will be very close to , and the attention score will be very high. The output vector will be very similar to .

Muilti-head

Suppose we have two heads, so we can use the following formula to calculate the attention score combined the two values as before:

  • In this case, even sometimes can reach a very high value, the only impact is that will be focusing on . The output vector will still be very similar to .

Positional embedding

Note

  • Permuting the input will not change the attention score without positional embedding. Basically, when computing the attention score, if the input is permuted, the attention score will be permuted as same way.

This will be problematic because the meaning of a sentence often depends on the order of the words. For example, โ€œThe cat sat on the matโ€ has a different meaning than โ€On the mat sat the catโ€. If the model cannot distinguish between these two sequences due to its permutation invariance, it may fail to capture important syntactic and semantic information necessary for understanding and generating coherent text.

Fixed method

Where and

Learnable method

In some cases, we can just set the positional encoding as a learnable parameter.

Rotary Positional Encoding

Where is the position index and is the angle:

The above Equa.(6) and Equa.(7) is equivalent. And this positional embedding is a relative methods.

Language pre-trained model

In this part, we have several tasks to do. All the details are in the Code part.

  1. Implemnent finetuning and make predictions;
  2. Define a span corruption function;
  3. Pretrain, finetune and make predictions;
  4. Implement RoPE function;
  5. Pretrain, finetune and make predictions;

Results

TypeAccuracy
Baseline5.00 %
Vallina(no pretraining)1.40 %
Vallina(pretraining)23.20 %
RoPE38.60 %

Note

  1. Tasks today shall be done today.

Code

Assign 4 Completion