Info

  • Acctually, there is not so much math that we need to do. However itโ€™s gonna make it easier to understand if we understand how to calculate the basic things i.e. derivatives / gradients.
  • The reason that modern nueral network is powerful is that it utilizes matrix caluculation with python(enven though python is not so efficient in some way). So here we do some matrix calculus.

Simple NN

---
config:
  layout: dagre
  look: handDrawn
---
flowchart LR
 subgraph subGraph0["Input Layer"]
    direction LR
        x1(("xโ‚"))
        x2(("xโ‚‚"))
        x3(("xโ‚ƒ"))
        x4(("xโ‚„"))
        x5(("xโ‚…"))
        xb[("b")]
  end
 subgraph subGraph1["Hidden Layer"]
    direction LR
        h1(("hโ‚"))
        h2(("hโ‚‚"))
        h3(("hโ‚ƒ"))
  end
    subGraph0 --> subGraph1
    subGraph1 --> s(("s"))
    style subGraph0 fill:#C8E6C9
    style subGraph1 fill:#E1BEE7
    style s fill:#BBDEFB
  • input layer โ‡’ hidden layer โ‡’ output layer
  • is the activation function, which is usually ReLU or sigmoid.

Note

  • So for this simple neural network, we can use the above formula to calculate the output . The input layer is the input vector , the hidden layer is the hidden vector , and the output layer is the output scalar .
  • And thoughout the network, we need to calculate the gradient of the loss function with respect to each parameter, which is . For simplicity, we can just use to represent the gradient of the loss function with respect to each parameter.

Tools

  1. Jacobian Matrix
  1. Chain Rule

Calculation

Going back to Simple NN and using Tools, we have:

  1. What is ? Because is element-wise, which means
  1. Other Jacobians

For example:

  1. What is ?
  1. What is ?

Shape of derivatives

  • Jacobian Form: the result should be a row vector. For easy calculation of chain rule.
  • Shape Convention: the result should be a column vector. For easy SGD when training.
  1. Use Jacobian form as much as possible, reshape to follow the shape convention at the end.
  2. Always follow the shape convention.

Backpropagation

Backpropagation is almost the most important part of DL. There are two steps:

  1. Using upstream derivatives and local derivatives to calculate downstream derivatives.
  2. Re-using these derivatives to update the parameters.
  • Calculate all gradients at once just like using

alt text

How to calculate the gradients in NN frame?

  1. Forwardprogate and record the intermediate values
  2. Backpropagate and calculate the gradients
    • Load basic explicit formula for each stepโ€™s gradient
    • Calculate the upsteam gradient
    • Then calculate the local gradient
    • Calculate the downsteam gradient

Reference

CS224N Lecture 3 PPT