Interview Preparation
Optimization & Training
Brief notes prepared for technical interviews
BackpropagationOptimizersInitializationActivations
← Back to Archives

These notes cover the mechanics that make deep networks trainable: how gradients are computed, how parameters are updated efficiently, how to choose initial values, and how activations shape gradient flow.

Backpropagation

Backpropagation (page 6 portion)

Backpropagation continued — Chain rule, Computation flow, Dynamic programming (page 7 portion)

Chain rule

Computation flow

Dynamic programming

Gradient Accumulation

Gradient Accumulation

Gradient Checkpointing

Gradient Checkpointing

Optimizer

Optimizer

Gradient Descent

Gradient Descent

\[\theta_{t+1} = \theta_t - \eta\, \nabla \mathcal{L}(\theta_t)\]

Learning Rate

Learning Rate

Step Size

Step Size & Gradient Clipping

Stochastic Gradient Descent (SGD)

SGD (page 9 portion)

Momentum

Momentum

\[v_t = \beta v_{t-1} + \nabla \mathcal{L}(\theta_t), \qquad \theta_{t+1} = \theta_t - \eta v_t\]

RMSProp

RMSProp

\[s_t = \rho\, s_{t-1} + (1-\rho)\, g_t^2, \qquad \theta_{t+1} = \theta_t - \eta\, \frac{g_t}{\sqrt{s_t + \epsilon}}\]

Adam

Adam (page 9 portion)

Adam (page 10 portion)

\[m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\] \[\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\]

Activation Function

Activation Function (Sigmoid / Tanh / ReLU)

Initialization

Initialization (page 12 portion)

Initialization (page 13 portion)

Random Initialization

Random Initialization

\[W_{ij} \sim \mathcal{N}(0, \sigma^2) \quad \text{or} \quad \mathcal{U}(-a, a)\]

Xavier Initialization

Xavier Initialization

\[\text{Var}(W) = \frac{2}{\text{fan}_{\text{in}} + \text{fan}_{\text{out}}}, \qquad W \sim \mathcal{U}\!\left(-\sqrt{\tfrac{6}{\text{fan}_{\text{in}} + \text{fan}_{\text{out}}}},\, \sqrt{\tfrac{6}{\text{fan}_{\text{in}} + \text{fan}_{\text{out}}}}\right)\]

He (Kaiming) Initialization

He Initialization (page 13 portion)

He Initialization (page 14 portion)

\[\text{Var}(W) = \frac{2}{\text{fan}_{\text{in}}}, \qquad W \sim \mathcal{N}\!\left(0, \frac{2}{\text{fan}_{\text{in}}}\right)\]