Info

Acctually, there is not so much math that we need to do. However it’s gonna make it easier to understand if we understand how to calculate the basic things i.e. derivatives / gradients.

The reason that modern nueral network is powerful is that it utilizes matrix caluculation with python(enven though python is not so efficient in some way). So here we do some matrix calculus.

Simple NN

---
config:
  layout: dagre
  look: handDrawn
---
flowchart LR
 subgraph subGraph0["Input Layer"]
    direction LR
        x1(("x₁"))
        x2(("x₂"))
        x3(("x₃"))
        x4(("x₄"))
        x5(("x₅"))
        xb[("b")]
  end
 subgraph subGraph1["Hidden Layer"]
    direction LR
        h1(("h₁"))
        h2(("h₂"))
        h3(("h₃"))
  end
    subGraph0 --> subGraph1
    subGraph1 --> s(("s"))
    style subGraph0 fill:#C8E6C9
    style subGraph1 fill:#E1BEE7
    style s fill:#BBDEFB

input layer ⇒ hidden layer ⇒ output layer
$σ$ is the activation function, which is usually ReLU or sigmoid.

x h s = [x_{1}, x_{2}, ..., x_{n}]^{T} z = W x + b = σ (z) = u^{T} h (1)

Note

So for this simple neural network, we can use the above formula to calculate the output $s$ . The input layer is the input vector $x$ , the hidden layer is the hidden vector $h$ , and the output layer is the output scalar $s$ .

And thoughout the network, we need to calculate the gradient of the loss function with respect to each parameter, which is $\frac{\partial L}{\partial W}, \frac{\partial L}{\partial b}, \frac{\partial L}{\partial u}$ . For simplicity, we can just use $s$ to represent the gradient of the loss function with respect to each parameter.

Tools

Jacobian Matrix

\frac{\partial f}{\partial x} = \frac{f _{1}}{x _{1}} ⋮ \frac{f _{m}}{x _{1}} \dots ⋱ \dots \frac{f _{1}}{x _{n}} ⋮ \frac{f _{m}}{x _{n}} (2)

Chain Rule

\frac{\partial f}{\partial x} = \frac{\partial f}{\partial y} \frac{\partial y}{\partial x} (3)

Calculation

Going back to Simple NN and using Tools, we have:

What is $\frac{\partial h}{\partial z}$ ? Because $σ (z)$ is element-wise, which means $h_{i} = σ (z_{i})$

(\frac{\partial h}{\partial z})_{i, j} = \frac{\partial h _{i}}{\partial z _{j}} = \frac{\partial σ ( z _{i} )}{\partial z _{j}} = {σ^{^{'}} (z_{i}) 0 if i=j if otherwise (4)

\frac{\partial h}{\partial z} = σ^{^{'}} (z_{1}) ⋮ 0 \dots ⋱ \dots 0 ⋮ σ^{^{'}} (z_{n}) = d ia g (σ^{^{'}} (z)) (5)

Other Jacobians

\frac{\partial}{\partial x} (Wx + b) = W \frac{\partial}{\partial W} (Wx + b) = x^{T} \frac{\partial}{\partial b} (Wx + b) = I \frac{\partial}{\partial u} (u^{T} h) = h^{T} (6)

For example:

What is $\frac{\partial s}{\partial b}$ ?

\frac{\partial s}{\partial b} = \frac{\partial s}{\partial h} \frac{\partial h}{\partial z} \frac{\partial z}{\partial b} = u^{T} d ia g (σ^{^{'}} (z)) I = u^{T} ⊙ σ^{^{'}} (z) = δ (7)

What is $\frac{\partial s}{\partial W}$ ?

\frac{\partial s}{\partial W} = same as before \frac{\partial s}{\partial h} \frac{\partial h}{\partial z} \frac{\partial z}{\partial W} = u^{T} d ia g (σ^{^{'}} (z)) x^{T} = δ^{T} x^{T} \in R^{m \times n} (8)

Shape of derivatives

Jacobian Form: the result should be a row vector. For easy calculation of chain rule.

Shape Convention: the result should be a column vector. For easy SGD when training.

Use Jacobian form as much as possible, reshape to follow the shape convention at the end.

Always follow the shape convention.

Backpropagation

Backpropagation is almost the most important part of DL. There are two steps:

Using upstream derivatives and local derivatives to calculate downstream derivatives.
Re-using these derivatives to update the parameters.

Calculate all gradients at once just like using $σ$

alt text

How to calculate the gradients in NN frame?

Forwardprogate and record the intermediate values
Backpropagate and calculate the gradients
- Load basic explicit formula for each step’s gradient
- Calculate the upsteam gradient
- Then calculate the local gradient
- Calculate the downsteam gradient

Reference

CS224N Lecture 3 PPT

🌲vsk_dl_notes

Explorer

Calculus

Simple NN

Tools

Calculation

Backpropagation

Reference

Graph View

Table of Contents