CNN & ResNet – Glanceyes

← Back to Archives

These notes were prepared while studying for technical interviews (e.g., Snap Inc., KRAFTON, etc.).

Each entry contains a concise English summary, key math expressions, and excerpts from my original handwritten/typed study notes.

These notes cover the convolutional building blocks that power modern computer-vision architectures, and the residual learning idea that made training very deep networks possible.

Convolutional Neural Networks (CNN)

CNN: basics

convolution kernels with weight sharing
local receptive field → local information
translation equivariance
parameter-efficient compared to fully connected layers
- same kernel → applied across spatial locations

Downsampling

Downsampling (page 35 portion)

Downsampling (page 36 top: stride formula + batch normalization)

reducing spatial resolution
pooling
- avoid heavy pooling by using strided convolutions
stride & padding: output size = $(H_{\text{in}} - k_H) / s + 1$
batch normalization → optimization ↑ (also acts as a regularizer)

Convolution Variants

Convolution variants (Depthwise / Pointwise / Separable)

Standard convolution

parameters: $C_{\text{in}} \cdot C_{\text{out}} \cdot k_H \cdot k_W$

Depthwise Convolution

using a kernel for each input channel
parameters: $C_{\text{in}} \cdot k_H \cdot k_W$

Pointwise Convolution

$1 \times 1$ convolution
parameters: $C_{\text{in}} \cdot C_{\text{out}}$

Depthwise Separable Convolution

depthwise + pointwise → mixing spatial and channel-wise computations cheaply
parameters: $k_H k_W C_{\text{in}} + C_{\text{in}} C_{\text{out}}$

Adaptive Average Pooling

nn.AdaptiveAvgPool2d((1,1))
- commonly used to remove spatial dimension before a FC layer
- variable input size
- fixed-size representation

ResNet

CNN architecture built from residual blocks
- enables training of very deep networks

Key idea

learn a residual function instead of a full mapping
output: $y = F(x) + x$
where $F(x)$ is a stack of conv + norm + activation layers

Why it helps (intuition)

if the optimal mapping is close to identity, it is easier to learn $F(x) \approx 0$
gradients can flow through the skip path directly → vanishing gradient ↓

Typical block types

Basic block: two $3 \times 3$ conv layers
Bottleneck block: $1 \times 1$ → $3 \times 3$ → $1 \times 1$ (compute-efficient)

Common details

if shapes differ, use a projection shortcut
- $y = F(x) + W_s x$ (e.g., $1 \times 1$ conv with stride)

Residual Connection

Residual Connection (page 38 portion)

Residual Connection (page 39 portion)

add an identity (skip) path to the main transformation

Effects

optimization becomes easier
- skip path provides a low-resistance gradient route
improves conditioning
- reduces sensitivity to depth in practice
allows deeper models without degradation

Where used

ResNet blocks
Transformers (residual around attention and MLP)
Diffusion U-Nets (skip connections across scales)