Given a vector of activations 1, RMSNorm will rescale each activation as follows:

Where,

Here, is a learnable “gain” parameter.. there are d_model such parameters total). is a hyperparameter that is often fixed at 1e-5. This is a mean function by the way :P

9/12/25

Footnotes

  1. def. That is, a vector that exists in the dimensional space of .