Optimizers

SGD

Stochastic Gradient Descent: is the fundamental algorithm for training neural networks, it updates model parameters θ using this rule:

θ_{t + 1} \leftarrow θ_{t} - α_{t} \nabla L (θ_{t}; B_{t})

Where:

$θ_{t}$ = current parameters
$α_{t}$ = learning rate at step $t$
$\nabla L (θ_{t}; B_{t})$ = gradient of loss on batch $B_{t}$
$B_{x}$ = random batch of training data

It works like this,

Start with random parameters $θ_{0}$
$for$ each training step:
- sample a batch of data
- compute how much the loss would change if you nudged each parameter (the gradient)
- move parameters in the opposite direction of the gradient to reduce loss
- repeat 🔁

This is what it looks like in python:

class SGD(torch.optim.Optimizer):
    def __init__(self, params, lr=1e-3):
        # Store parameters and hyperparameters
        super().__init__(params, {"lr": lr})
 
    def step(self):
        # Update each parameter using gradients
        for group in self.param_groups:
            for p in group["params"]:
                if p.grad is not None:
                    p.data -= group["lr"] * p.grad.data

The Decaying Learning Rate Variant: implements SGD with learning rate decay:

θ_{t + 1} = θ_{t} - \frac{α}{t + 1} \nabla L (θ_{t}; B_{t})

This reduces the learning rate over time, taking smaller steps as training progresses.

Points to Note:

SGD is “stochastic” because it uses random batches, not the full dataset
The learning rate controls step size - too large causes instability, too small is slow
Modern optimizers like AdamW (below) build on SGD’s foundation but add momentum and adaptive learning rates

AdamW

!500 Anyone knows adam?

Adam is a more sophisticated optimizer. It maintains running estimates of both first and second moments of gradients, making it more stable and faster-converging than basic SGD.

m \leftarrow β₁m + (1 - β₁)g # First moment (momentum) v \leftarrow β₂v + (1 - β₂)g² # Second moment (variance) αₜ \leftarrow α * \sqrt(1-(β₂)ᵗ) / (1-(β₁)ᵗ) # Bias correction θ \leftarrow θ - αₜ * m/\sqrt(v+ε) # Parameter update θ \leftarrow θ - αλθ # Weight decay (AdamW's key difference)

9/17/25

vishwiki

Explorer

Optimizers

SGD

AdamW

Graph View

Table of Contents

Backlinks