Attention

idk u have keys and values for each token to know what they should care about, then look at how close they align then Softmax to get weights :3

bruh idk

Source: Attention in transformers, visually explained | Chapter 6, Deep Learning, 3Blue1Brown

Scaled Dot Product Attention

softmax (v)_{i} = \frac{exp ( v _{i} )}{\sum _{j = 1}^{n} exp ( v _{j} )}

We can now define the Attention operation mathematically as follows:

Attention (Q, K, V) = softmax (\frac{Q ^{T} K}{d _{k}}) V

where $Q \in R^{n \times d_{k}}$ , $K \in R^{m \times d_{k}}$ , and $V \in R^{m \times d_{v}}$ . Here, $Q$ , $K$ and $V$ are all inputs to this operation—note that these are the outputs of learnable linear transformations, not the parameters themselves.

The actual learnable parameters are the weight matrices that produce Q, K, V:

Q = input @ W_Q  # W_Q is learnable
K = input @ W_K  # W_K is learnable
V = input @ W_V  # W_V is learnable

The reason it’s $Q^{T} K$ instead of $Q K^{T}$ is — well this is the same row-major vs column-major issue I’ve encountered with linear layers. The mathematical notation assumes column vectors, but this implementation uses row-major tensors. Haha

Masking: It is sometimes convenient to mask¹ the output of an attention operation.

A mask should have the shape $M \in True, False^{n \times m}$ , and each row $i$ of this boolean matrix indicates which keys the query $i$ should attend to. Canonically (and slightly confusingly), a value of True at position $(i, j)$ indicates that the query $i$ does attend to the key $j$ , and a value of False indicates that the query does not attend to the key. In other words, “information flows” at $(i, j)$ pairs with value True. So masks are like, the opposite of what a mask is, basically. For example, consider a $1 \times 3$ mask matrix with entries $[True, True, False]$ . The single query vector attends only to the first two keys.

Masks are useful for a number of reasons. First, they prevent the model from “cheating” by seeing future tokens during training. They also prevent attention to meaningless padding tokens. We can also implement other constraints (like preventing attention across sentence boundaries). Neat!

They are not learned, casual masking is always the same lower triangular pattern—tril()—again this prevents the model from seeing future tokens. Padding masks are based on which positions are padding in your input, this varies per batch but not not learned. And too, fixed rules like “only attend to tokens within 512 positions”. Again, I reiterate, masks implement architectural constraints, not learned behaviors. Some advanced architectures do have learnable attention patterns, but standard transformer masks are fixed architectural choices.

The mask says "you CAN attend to these positions" but the learned weights determine "you SHOULD attend to these positions with this strength."

Computationally, it will be much more efficient to use masking than to compute attention on subsequences, and we can do this by taking the pre-softmax values $(\frac{Q ^{T} K}{d _{k}})$ and adding a $- \infty$ in any entry of the mask matrix that is False.

Causal Multi-Head Self-Attention

We will implement multi-head self-attention as described in section 3.2.2 of Vaswani et al. (2017). Recall that, mathematically, the operation of applying multi-head attention is defined as follows:

MultiHead (Q, K, V) = Concat (head_{1}, \dots, head_{h}, for head_{i} = Attention (Q_{i}, K_{i}, V_{i})

with $Q_{i}$ , $K_{i}$ , $V_{i}$ being slice number $i \in 1, \dots, h$ of size $d_{k}$ or $d_{v}$ of the embedding dimension for $Q$ , $K$ , and $V$ respectively. With Attention being the scaled dot-product attention operation. From this we can form the multi-head self-attention operation:

MultiHeadSelfAttention (x) = W_{O} MultiHead (W_{Q} x, W_{K} x, W_{V} x)

Here, the learnable parameters are $W_{Q} \in R^{h d_{k} \times d_{model}}$ , $W_{K} \in R^{h d_{k} \times d_{model}}$ , $W_{V} \in R^{h d_{v} \times d_{model}}$ , and $W_{O} \in R^{d_{model} \times h d_{v}}$ . Since the $Q$ s, $K$ , and $V$ s are sliced in the multi-head attention operation, we can think of $W_{Q}$ , $W_{K}$ and $W_{V}$ as being separated for each head along the output dimension. When you have this working, you should be computing the key, value, and query projections in a total of three matrix multiplies.

This implementation should prevent the model from attending to future tokens in the sequence, and RoPE should be applied to the query and key vectors, but not the value vectors. Also, the head dimension should be handled as a batch dimension, because in multi-head attention, attention is being applied independently for each head. This means that precisely the same RoPE rotation should be applied to the query and key vectors for each head.

8/30/24 9/16/25

def. A mask in attention is a boolean matrix that controls which tokens can attend to which other tokens. ↩

vishwiki

Explorer

Attention

Scaled Dot Product Attention

Causal Multi-Head Self-Attention

Graph View

Table of Contents

vishwiki

Explorer

Attention

Scaled Dot Product Attention

Causal Multi-Head Self-Attention

Footnotes

Graph View

Table of Contents