Cross-entropy loss measures how far your model’s predictions are from the correct answers. For language modeling, at each position, the model predicts a probability distribution over the vocabulary, and you want high probability on the correct next token.

For each position i, transformer produces o_i ∈ R^vocab_size (raw scores), we can then apply softmax to get p(x_{i+1} | x_{1:i}) = softmax(o_i)[x_{i+1}], take negative log-probability of the correct token: -log p(x_{i+1} | x_{1:i}) and finally sum over all positions and sequences, then average. Mathematically as follows:

Recall that the Transformer language model defines a distribution for each sequence of length and . Given a training set consisting of sequences of length , we define the standard cross-entropy (negative log-likelihood) loss function:

(Note that a single forward pass in the Transformer yields for all .)

In particular, the Transformer computes logits for each position , which results in:

The cross entropy loss is generally defined with respect to the vector of logits and target .

Implementing the cross entropy loss requires some care with numerical issues, just like in the case of Softmax.

9/17/25