Content
In Assignment 3, we will explore the reason why using multihead self-attention and the positional encoding and the little but powerful pre-trained model minGPT
Attention exploration
For a simple single-head self-attention, we can use the following formula to calculate the attention score:
Single head
Suppose that all the keys are orthogonal, i.e., for all .
- If is very closely aligned to , then the attention score will be very high. The output vector will be very similar to .
Think that we have a query vector incorporating information from two vectors and with corresponding keys and :
Where is a large positive scalar. By doing this combination, we can achive a value .
- However, if the key vector, letโs say has large veriance. And when is a vector with great magnitude, the query vector will be very close to , and the attention score will be very high. The output vector will be very similar to .
Muilti-head
Suppose we have two heads, so we can use the following formula to calculate the attention score combined the two values as before:
- In this case, even sometimes can reach a very high value, the only impact is that will be focusing on . The output vector will still be very similar to .
Positional embedding
Note
- Permuting the input will not change the attention score without positional embedding. Basically, when computing the attention score, if the input is permuted, the attention score will be permuted as same way.
This will be problematic because the meaning of a sentence often depends on the order of the words. For example, โThe cat sat on the matโ has a different meaning than โOn the mat sat the catโ. If the model cannot distinguish between these two sequences due to its permutation invariance, it may fail to capture important syntactic and semantic information necessary for understanding and generating coherent text.
Fixed method
Where and
Learnable method
In some cases, we can just set the positional encoding as a learnable parameter.
Rotary Positional Encoding
Where is the position index and is the angle:
The above Equa.(6) and Equa.(7) is equivalent. And this positional embedding is a relative methods.
Language pre-trained model
In this part, we have several tasks to do. All the details are in the Code part.
- Implemnent finetuning and make predictions;
- Define a span corruption function;
- Pretrain, finetune and make predictions;
- Implement RoPE function;
- Pretrain, finetune and make predictions;
Results
| Type | Accuracy |
|---|---|
| Baseline | 5.00 % |
| Vallina(no pretraining) | 1.40 % |
| Vallina(pretraining) | 23.20 % |
| RoPE | 38.60 % |
Note
- Tasks today shall be done today.