Momentum & Nesterov Acceleration

Introduction

Momentum is one of the most important innovations in optimization, transforming slow, oscillating SGD into a fast, smooth optimizer. Think of it like a ball rolling down a hill - it builds up speed in consistent directions and dampens oscillations. Nesterov momentum takes this further by "looking ahead" before making updates, providing even better convergence properties.

The Problem with Vanilla SGD

Standard SGD suffers from several issues:

Slow convergence in directions with small but consistent gradients
Oscillations in directions with large gradients
Poor conditioning when different parameters have different scales
Zigzagging behavior instead of taking direct paths to the minimum

Momentum addresses all of these problems elegantly.

How Momentum Works

The Physical Analogy

Imagine a ball rolling down a hill:

Gravity (gradient) pulls it toward the bottom
Momentum keeps it moving in the same direction
Friction (learning rate) controls how much the gradient affects motion

Mathematical Formulation

Standard SGD:

θ = θ - α∇L(θ)

SGD with Momentum:

v = βv + ∇L(θ)
θ = θ - αv

Where:

v is the velocity (accumulated gradients)
β is the momentum coefficient (typically 0.9)
α is the learning rate

Key Insights

Acceleration: Consistent gradient directions build up velocity
Damping: Oscillating gradients cancel out in the velocity
Memory: Past gradients influence current updates
Smoothing: Velocity provides a smoothed version of recent gradients

Nesterov Accelerated Gradient (NAG)

Nesterov momentum improves on standard momentum by "looking ahead":

Standard Momentum:

Compute gradient at current position
Update velocity with this gradient
Move using the velocity

Nesterov Momentum:

Look ahead to where momentum would take us
Compute gradient at that lookahead position
Update velocity with the lookahead gradient
Move using the velocity

Mathematical Formulation

v = βv + ∇L(θ - αβv)
θ = θ - αv

The key difference: gradient is computed at θ - αβv (lookahead position) instead of θ.

Why Nesterov Works Better

Anticipation: Sees where momentum is taking us before committing
Correction: Can slow down before overshooting
Stability: More stable near the minimum
Theory: Better convergence guarantees

Interactive Demo

Experiment with different momentum settings:

Compare Methods:
- None: See vanilla SGD behavior
- Standard: Watch momentum smooth the path
- Nesterov: Observe the lookahead advantage
Adjust Momentum Coefficient:
- β = 0.0: No momentum (equivalent to SGD)
- β = 0.5: Light momentum
- β = 0.9: Strong momentum (typical default)
- β = 0.99: Very strong momentum
Test Different Landscapes:
- Quadratic Bowl: See basic acceleration
- Elongated Valley: Watch momentum handle ill-conditioning
- Rosenbrock: Observe navigation of complex landscapes
- Saddle Point: See how momentum escapes saddle points
- Zigzag: Watch momentum smooth oscillatory behavior

Understanding the Visualizations

Optimization Path

No Momentum: Zigzag pattern following gradients directly
With Momentum: Smoother, more direct path to minimum
Nesterov: Often takes slightly different, more efficient path

Velocity Evolution

Buildup: Velocity grows in consistent gradient directions
Oscillation: Velocity oscillates less than raw gradients
Decay: Velocity decreases as gradients get smaller near minimum

Gradient vs Velocity Magnitude

Early Training: Velocity magnitude grows as momentum builds
Later Training: Velocity becomes smoother than raw gradients
Convergence: Both decrease as minimum is approached

When Momentum Helps

Excellent for:

Ill-conditioned problems (elongated valleys)
Noisy gradients (stochastic settings)
Saddle points (momentum helps escape)
Long, consistent gradient directions

Less helpful for:

Very noisy objectives (momentum can accumulate noise)
Frequently changing optimal directions
Very well-conditioned problems (already fast)

Choosing the Momentum Coefficient

β = 0.0 (No Momentum)

Equivalent to standard SGD
Use when momentum hurts performance

β = 0.5 (Light Momentum)

Gentle acceleration
Good for noisy or changing objectives

β = 0.9 (Standard Momentum)

Most common choice
Good balance of acceleration and stability

β = 0.99 (Heavy Momentum)

Strong acceleration
Risk of overshooting
Good for very consistent gradients

Momentum vs Other Methods

Method	Acceleration	Memory	Complexity	When to Use
SGD	None	None	Lowest	Simple problems
Momentum	Yes	Velocity	Low	Most problems
Nesterov	Yes+	Velocity	Low	When momentum works
Adam	Yes	2 vectors	Medium	Adaptive needs