Tip
Computational Complexity
- Terminology:
- MACs = multiply-accumulate ops;
- FLOPs = floating point operations; Many sources treat 1 MAC β 2 FLOPs (one mul + one add). Accounting conventions differβstate your convention when comparing.
In most cases of modern packges, MACs is more often used.
-
Output size for 2D conv:
H_out = floor((H + 2p - k_h)/s) + 1,W_out = floor((W + 2p - k_w)/s) + 1. -
Typical forward-pass MACs (inference):
- Linear:
MACs β d_in * d_out - Conv2d:
MACs β H_out * W_out * C_out * (k_h * k_w * C_in) - Grouped conv: divide by
groups - Depthwise separable:
MACs_depthwise β H_out*W_out*(k_h*k_w*C_in)MACs_pointwise β H_out*W_out*(C_in*C_out) - Self-Attention (seq n, dim d, heads h):
β 4*n*d^2 (projections) + 2*n^2*d (attn scores & apply V) - Transformer FFN (dβ4dβd):
β 8*n*d^2 - Training usually costs β 2β3Γ the forward MACs.
- Linear:
Total Parameters
Computing total parameters is very simple.
Linear (Fully-Connected)
Formula: (add bias only if enabled)
Quick check (summary):
layer = nn.Linear(512, 1000, bias=True)
summary(layer, input_size=(1, 512))
# (batch, d_in) β params should be 512*1000+1000 = 51300Standard 2D Convolution (no groups)
Formula:
Quick check (summary):
conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)
summary(conv, input_size=(1, 3, 224, 224)) # params should be 3*3*3*64 + 64 = 1792Grouped Convolution (groups = g)
Formula:
Quick check (summary):
gconv = nn.Conv2d(32, 64, kernel_size=3, padding=1, groups=2)
summary(gconv, input_size=(1, 32, 56, 56)) # 3*3*32/2*64 + 64 = 9280Depthwise-Separable Convolution
Formula:
- Depthwise:
- Pointwise (1Γ1):
- Total = depthwise + pointwise
Quick check (summary):
depthwise_separable = nn.Sequential(
nn.Conv2d(32, 32, kernel_size=3, padding=1, groups=32), # depthwise
nn.Conv2d(32, 64, kernel_size=1) # pointwise
)
summary(depthwise_separable, input_size=(1, 32, 56, 56))
# (3*3*32 + 32) + (32*64 + 64) = 2432BatchNorm / LayerNorm (trainable parts only)
Rule of thumb: Count only learnable Ξ³, Ξ² (running stats are buffers).
- BatchNorm2d(C):
- LayerNorm(D):
Quick check (summary):
bn = nn.BatchNorm2d(64)
summary(bn, input_size=(1, 64, 32, 32)) # expect 2*64 params (Ξ³, Ξ²)
ln = nn.LayerNorm(128)
summary(ln, input_size=(1, 10, 128)) # last dim = 128 β expect 2*128 paramsLSTM (per layer; shared across time)
Formula: Theory: (4 gates; add bias per gate) Practice: (pytorch uses combined matrices)
Quick check (summary):
class LSTMWrap(nn.Module):
def __init__(self):
super().__init__()
self.lstm = nn.LSTM(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
def forward(self, x):
y, _ = self.lstm(x)
return y
model = LSTMWrap()
summary(model, input_size=(32, 10, 128)) # (batch, seq, d_in)
# 4*(128*256 + 256*256 + 2*256) = 395264GRU (per layer; shared across time)
Formula:
Quick check (summary):
class GRUWrap(nn.Module):
def __init__(self):
super().__init__()
self.gru = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
def forward(self, x):
y, _ = self.gru(x)
return y
model = GRUWrap()
summary(model, input_size=(32, 10, 128))Multi-Head Self-Attention (params only; not counting MLP)
Heuristic: .
Quick check (summary):
class MHAWrap(nn.Module):
def __init__(self):
super().__init__()
self.mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)
def forward(self, x):
y, _ = self.mha(x, x, x)
return y
model = MHAWrap()
summary(model, input_size=(32, 16, 512)) # (batch, seq, d)
# 4 * 512 * 512 + 4 * 512 = 1050624Transformer Full encoder layer:
enc = nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=2048, batch_first=True)
summary(enc, input_size=(32, 16, 512))
# 3,152,384