RMS Layer Normalization

Given a vector of activations $a \in R^{d_{model}}$ ¹, RMSNorm will rescale each activation $a_{i}$ as follows:

RMSNorm (a_{i}) = \frac{a _{i}}{RMS ( a )} g_{i}

Where,

RMS (a) = \frac{1}{d _{model}} i = 1 \sum d_{model} a_{i}^{2} + ε

Here, $g_{i}$ is a learnable “gain” parameter.. there are d_model such parameters total). $ϵ$ is a hyperparameter that is often fixed at 1e-5. This is a mean function by the way :P

9/12/25

def. That is, a vector $a$ that exists in the dimensional space of $model$ . ↩

vishwiki

Explorer

RMS Layer Normalization

Graph View

Backlinks

vishwiki

Explorer

RMS Layer Normalization

Footnotes

Graph View

Backlinks