SGD
Stochastic Gradient Descent: is the fundamental algorithm for training neural networks, it updates model parameters θ using this rule:
Where:
- = current parameters
- = learning rate at step
- = gradient of loss on batch
- = random batch of training data
It works like this,
- Start with random parameters
- each training step:
- sample a batch of data
- compute how much the loss would change if you nudged each parameter (the gradient)
- move parameters in the opposite direction of the gradient to reduce loss
- repeat 🔁
This is what it looks like in python:
class SGD(torch.optim.Optimizer):
def __init__(self, params, lr=1e-3):
# Store parameters and hyperparameters
super().__init__(params, {"lr": lr})
def step(self):
# Update each parameter using gradients
for group in self.param_groups:
for p in group["params"]:
if p.grad is not None:
p.data -= group["lr"] * p.grad.dataThe Decaying Learning Rate Variant: implements SGD with learning rate decay:
This reduces the learning rate over time, taking smaller steps as training progresses.
Points to Note:
- SGD is “stochastic” because it uses random batches, not the full dataset
- The learning rate controls step size - too large causes instability, too small is slow
- Modern optimizers like AdamW (below) build on SGD’s foundation but add momentum and adaptive learning rates
AdamW
!500 Anyone knows adam?
Adam is a more sophisticated optimizer. It maintains running estimates of both first and second moments of gradients, making it more stable and faster-converging than basic SGD.
m ← β₁m + (1 - β₁)g # First moment (momentum) v ← β₂v + (1 - β₂)g² # Second moment (variance) αₜ ← α * √(1-(β₂)ᵗ) / (1-(β₁)ᵗ) # Bias correction θ ← θ - αₜ * m/√(v+ε) # Parameter update θ ← θ - αλθ # Weight decay (AdamW's key difference)9/17/25