Grading
Info
- Simple: 0.60824
- Medium: 0.70375
- strong: 0.77750
- Boss: 0.86500
Solution
| Model | HyperParameters | Train acc | Valid acc | Public Score | Private Score | Timer |
|---|---|---|---|---|---|---|
| all | 1-5 | 1.00 | 0.9306 | 0.84325 | 0.84200 | ~35 m |
| all | 1-6 head 8 | 1.00 | 0.9460 | 0.83675 | 0.83550 | … |
| all | 1-5 + 7, 1w steps | 1.00 | 0.9140 | 0.86250 | 0.86050 | 47 m |
| all | 1-7, 1w steps | 1.00 | 0.9325 | 0.83850 | 0.83325 | 138 m |
| all | 1-5 + 7 | 1.00 | 0.9085 | 0.86400 | 0.86100 | 36 m |
| all | 1-5 + 7, 1w steps | 1.00 | 0.9140 | 0.86250 | 0.86050 | 47 m |
| all | 1-5 + 8 | 1.00 | 0.9384 | 0.87150 | 0.87325 | 35 m |
| all | 1-5 + 8, 3 layers | 1.00 | 0.9433 | 0.88125 | 0.87900 | 47 m |
Techniques
- d_model = 80 ⇒ 256
- drop_out = 0.1 ⇒ 0.3
- Conformer Layer
from torchaudio.models import Conformer ...... self.conformer = Conformer( input_dim=d_model, num_heads=4, ffn_dim=512, num_layers=2, depthwise_conv_kernel_size=31, dropout=dropout ) - self attention pooling
class SelfAttentionPooling(nn.Module): def __init__(self, input_dim): super().__init__() self.attention = nn.Linear(input_dim, 1, bias=False) def forward(self, batch_rep, mask=None): # scores: (N, T) scores = self.attention(batch_rep).squeeze(-1) if mask is not None: scores = scores.masked_fill(~mask.bool(), float("-inf")) att_w = F.softmax(scores, dim=1) # (N, T) utter_rep = torch.bmm(att_w.unsqueeze(1), batch_rep).squeeze(1) # (N, H) return utter_rep - ASMsoftmax
class AMSoftmaxCE(nn.Module): def __init__(self, in_features, n_classes, s=30.0, m=0.35): super().__init__() self.W = nn.Linear(in_features, n_classes, bias=False) self.s, self.m = float(s), float(m) def forward(self, x, y=None): x = F.normalize(x, dim=-1) W = F.normalize(self.W.weight, dim=-1) logits = F.linear(x, W) # cos logits = logits * self.s if y is None: return logits logits[torch.arange(x.size(0)), y] -= self.s * self.m return F.cross_entropy(logits, y) - batch_size = 32 ⇒ 128
- concat out.max + batchnorm(out) + dropout
class Classifier(nn.Module): ...... def forward(self, x): out = self.prenet(mels) # (batch_size, length, d_model) out, _ = self.conformer(out, torch.full((out.shape[0],),out.shape[1]).cuda()) # self attention pooling (batch_size, d_model) stats = self.pooling(out) out = out.max(dim=1)[0] # batchnorm out = self.bn(torch.cat([stats, out], dim=1)) out = self.drop(out) out = self.classier(out) - batchnorm(out) only
Reflection
Note
- As presented in the intro part of this homework by AT, I just adopted the idea of using conformer, self attention pooling and ASMsoftmax to solve the problem. Surprisingly, the result is very good. Then I tried to use more layers, more heads and other teches, but the result is not ideal, even worse.
- Then I thought maybe because self attention pooling dropped too much information, so I decided to calculate the self attention pooling and max pooling after conformer layers, and concatenate them together and feed into the ASMsoftmax. Finally I got 0.86 which is so close to boss line.
- When I increased the batch size from 32 to 128, the test accuracy increased, but the score decreased with much more time of training.
- Finally, I just realized that why I combined max pooling and self attention pooling together. So I deleted the max pooling and got the score 0.87 over boss line. Then I tried to plus one more conformer layer and got the score 0.88!!!
- I think the model is kind of good already if just using strategy 1 to 5, and only plus batch norm layer after the self attention pooling, we will get over the boss line easily within 35 mins. And with one more conformer layer, we can get 0.88.
- Next time: focus on boss line strategy; use one more change one time; then record.
Not Help
- batch size
- train steps
- more complexity (as 4 conformer layers will get around 0.865)