Cross-Entropy Loss

Cross-entropy loss measures how far your model’s predictions are from the correct answers. For language modeling, at each position, the model predicts a probability distribution over the vocabulary, and you want high probability on the correct next token.

For each position i, transformer produces o_i ∈ R^vocab_size (raw scores), we can then apply softmax to get p(x_{i+1} | x_{1:i}) = softmax(o_i)[x_{i+1}], take negative log-probability of the correct token: -log p(x_{i+1} | x_{1:i}) and finally sum over all positions and sequences, then average. Mathematically as follows:

Recall that the Transformer language model defines a distribution $p_{θ} (x_{i + 1} ∣ x_{1 : i})$ for each sequence $x$ of length $m + 1$ and $i = 1, \dots, m$ . Given a training set $D$ consisting of sequences of length $m$ , we define the standard cross-entropy (negative log-likelihood) loss function:

ℓ (θ; D) = \frac{1}{∣ D ∣ m} x \in D \sum i = 1 \sum m - lo g p_{θ} (x_{i + 1} ∣ x_{1 : i}) .

(Note that a single forward pass in the Transformer yields $p_{θ} (x_{i + 1} ∣ x_{1 : i})$ for all $i = 1, \dots, m$ .)

In particular, the Transformer computes logits $o_{i} \in R^{vocab_size}$ for each position $i$ , which results in:

p (x_{i + 1} ∣ x_{1 : i}) = softmax (o_{i}) [x_{i + 1}] = \frac{exp ( o _{i} [ x _{i + 1} ])}{\sum _{a = 1}^{vocab_size} exp ( o _{i} [ a ])} .

The cross entropy loss is generally defined with respect to the vector of logits $o_{i} \in R^{vocab_size}$ and target $x_{i + 1}$ .

Implementing the cross entropy loss requires some care with numerical issues, just like in the case of Softmax.

9/17/25

vishwiki

Explorer

Cross-Entropy Loss

Graph View

Backlinks