Content

In Assignment 2, we will build a neural machine translation model using RNN and attention. Then we will do some analysis of NMT systems.

Model constructure

seq2seq

Word embedding layer
Source CNN layer: 1D in the dimension of sentence length, kernel size = 2
Encoder: bidirectional LSTM

h_{i}^{e n c} = [h_{i}^{e n c}, h_{i}^{e n c}] where h_{i}^{e n c} \in R^{2 h \times 1}, h_{i}^{e n c}, h_{i}^{e n c} \in R^{h \times 1} i \in [1, m] (1)

c_{i}^{e n c} = [c_{i}^{e n c}, c_{i}^{e n c}] where c_{i}^{e n c} \in R^{2 h \times 1}, c_{i}^{e n c}, c_{i}^{e n c} \in R^{h \times 1} i \in [1, m] (2)

Decoder: LSTM

h_{0}^{d ec} = W_{h} [h_{1}^{e n c}, h_{m}^{e n c}] where h_{0}^{d ec} \in R^{h \times 1}, W_{h} \in R^{h \times 2 h} (3)

c_{0}^{d ec} = W_{c} [c_{1}^{e n c}, c_{m}^{e n c}] where c_{0}^{d ec} \in R^{h \times 1}, W_{c} \in R^{h \times 2 h} (4)

On the $t^{t h}$ step, we look up the embedding of the target word $y_{t}$ and concatenate it with the combined output vector $o_{t - 1} \in R^{h \times 1}$ to produce the input to the decoder at step $t$ , which will be $\overline{y_{t}} \in R^{(h + e) \times 1}$ .

h_{t}^{d ec}, c_{t}^{d ec} = Decoder (\overline{y_{t}}, h_{t - 1}^{d ec}, c_{t - 1}^{d ec}) where h_{t}^{d ec}, c_{t}^{d ec} \in R^{h \times 1} (5)

Attention: multiplicative attention

e_{t} = (h_{t}^{d ec})^{⊤} W_{a ttP ro j} h_{i}^{e n c} where e_{t} \in R^{m \times 1}, W_{a ttP ro j} \in R^{h \times 2 h} (6)

α_{t} = softmax (e_{t}) where α_{t} \in R^{m \times 1} (7)

a_{t} = i = 1 \sum m α_{t, i} h_{i}^{e n c} where a_{t} \in R^{2 h \times 1} (8)

Combined output

u_{t} = [a_{t}, h_{t}^{d ec}] where u_{t} \in R^{(3 h) \times 1} (9)

v_{t} = W_{u} u_{t} where v_{t} \in R^{h \times 1}, W_{u} \in R^{h \times (3 h)} (10)

o_{t} = d ro p o u t (t anh (v_{t})) where o_{t} \in R^{h \times 1} (11)

Output layer

P_{t} = so f t ma x (W_{v oc ab} o_{t}) where P_{t} \in R^{V \times 1}, W_{v oc ab} \in R^{V \times h} (12)

Results

BLEU Score: 20.11
epoch: 5, patience: 1, others are default.

Attention score

Dot product α = q^{⊤} k Mutiplicative α = q^{⊤} W_{k} k Additive α = W_{v} tanh (W_{k} k + W_{q} q) (13)

How to choose?

Dot product: simple and fast, but not good for capturing complexity.
Multiplicative: good for long sentences, but kind of slower but efficient.
Additive: good for complexity, but very slow and expensive.

Analysis of NMT

1D CNN

Adding a 1D Convolutional layer can help capture local patterns and relationships between adjacent characters in the input sequence. This is particularly useful in Mandarin Chinese, where characters can combine to form words or morphemes. The convolutional layer can learn to recognize these combinations and extract relevant features, which can then be passed to the bidirectional encoder for further processing.

Improvement

More data exposure: let the model to learn more from the data
More complex model: more layers, more heads for the model to capture more complex patterns
Advanced mechanisms: advanced mechanisms such as transformer, self-attention, etc.
Other training techniques: more epochs, adaptive learning rate, more dropout, etc.

BLEU

Suppose we have a reference sentence $r$ and a candidate sentence $c$ , and we want to calculate the BLEU score for $c$ given $r$ . Here we set n-gram size to be 4. And $λ_{n}$ is the weight of the $n$ -gram precision.

p_{n} = \frac{n g r am \in c \sum min ( i = 1 , \dots , k max C o u n t _{r_{i}} ( n g r am ) , C o u n t _{c} ( n g r am ) )}{n g r am \in c \sum C o u n t _{c} ( n g r am )} (14)

BP = {1, if c \geq r exp (1 - \frac{r}{c}), otherwise (15)

B L E U = BP \times exp (n = 1 \sum 4 λ_{n} lo g p_{n}) (16)

the BLEU score is not a perfect metric for evaluating NMT translations. It is a good metric for comparing translations with different reference translations, but it does not take into account the fact that there may be multiple reference translations.

Advantages

It is fast and cheap to compute, as it does not require human annotators.
It is objective and reproducible, as it is based on a fixed set of reference translations.

Disadvantages

Surface n-gram overlap only: poor at judging meaning/adequacy, paraphrases, and longrange reordering; weak at sentence-level.
Precision-heavy with a simple brevity penalty: can be gamed and struggles with morphology/rarer words and semantic correctness.

Note

In this assignment, we build the NMT model from scratch. I think it’s a good way to understand the whole image of the model structure with both the math and the code.

I also see some limitations of NMT from this assignment. Like the BLEU score is not a perfect metric while it’s a good metric for comparing translations with different reference translations.

Code

Assign 3 Completion

🌲vsk_dl_notes

Explorer

Assign 3 Translation

Content

Model constructure

Attention score

Analysis of NMT

1D CNN

Improvement

BLEU

Code

Graph View

Table of Contents