SGD

Stochastic Gradient Descent: is the fundamental algorithm for training neural networks, it updates model parameters θ using this rule:

Where:

  • = current parameters
  • = learning rate at step
  • = gradient of loss on batch
  • = random batch of training data

It works like this,

  • Start with random parameters
  • each training step:
    • sample a batch of data
    • compute how much the loss would change if you nudged each parameter (the gradient)
    • move parameters in the opposite direction of the gradient to reduce loss
    • repeat 🔁

This is what it looks like in python:

class SGD(torch.optim.Optimizer):
    def __init__(self, params, lr=1e-3):
        # Store parameters and hyperparameters
        super().__init__(params, {"lr": lr})
 
    def step(self):
        # Update each parameter using gradients
        for group in self.param_groups:
            for p in group["params"]:
                if p.grad is not None:
                    p.data -= group["lr"] * p.grad.data

The Decaying Learning Rate Variant: implements SGD with learning rate decay:

This reduces the learning rate over time, taking smaller steps as training progresses.

Points to Note:

  • SGD is “stochastic” because it uses random batches, not the full dataset
  • The learning rate controls step size - too large causes instability, too small is slow
  • Modern optimizers like AdamW (below) build on SGD’s foundation but add momentum and adaptive learning rates

AdamW

!500 Anyone knows adam?

Adam is a more sophisticated optimizer. It maintains running estimates of both first and second moments of gradients, making it more stable and faster-converging than basic SGD.

m ← β₁m + (1 - β₁)g # First moment (momentum) v ← β₂v + (1 - β₂)g² # Second moment (variance) αₜ ← α * √(1-(β₂)ᵗ) / (1-(β₁)ᵗ) # Bias correction θ ← θ - αₜ * m/√(v+ε) # Parameter update θ ← θ - αλθ # Weight decay (AdamW's key difference)

9/17/25