Interview Preparation
Sequence Models
Brief notes prepared for technical interviews
RNN / LSTMTransformerAttention VariantsVision Transformer
← Back to Archives

These notes cover the evolution from recurrent to attention-based sequence models, the mechanics of self-attention, the variants used to control its cost, and the adaptation of Transformers to images via patch tokens.

RNN

RNN intro (page 36 portion)

RNN intro (page 37 portion)

\[h_t = f(W_x x_t + W_h h_{t-1} + b)\]

Backpropagation Through Time (BPTT)

Backpropagation Through Time

Limitations

RNN Limitations

LSTM / GRU

LSTM / GRU

RNN vs Transformer

RNN vs Transformer: RNN side (page 37)

RNN vs Transformer: Transformer side (page 38 top)

Transformer

Transformer intro

Encoder

Encoder

Decoder

Decoder

Scaled Dot-Product Attention

Scaled Dot-Product Attention

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]

Why divide by $\sqrt{d_k}$?

Multi-Head Attention

Multi-Head Attention

Attention Complexity

Attention Complexity

Long-Context Attention Variants

Long-context approaches intro

Linear Attention

Linear Attention

\[\operatorname{softmax}(Q K^\top) \approx \phi(Q)\, \phi(K)^\top\]

Sliding Window Attention

Sliding Window Attention

Sparse / Flash Attention

Positional Encoding

Sinusoidal Positional Encoding

Sinusoidal Positional Encoding

\[\text{PE}_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \qquad \text{PE}_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)\]

Rotary Positional Encoding (RoPE)

Rotary Positional Encoding (RoPE)

\[\begin{pmatrix} x'_{2i} \\ x'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix}, \qquad z_i(p) = z_i \cdot e^{i \theta_i p}\]

Vision Transformer (ViT)

ViT intro (page 42)

ViT intro (page 43)

Key trade-off

Patch Embedding

Patch Embedding

Positional Encoding

ViT Positional Encoding

[CLS] Token

[CLS] Token