idk u have keys and values for each token to know what they should care about, then look at how close they align then Softmax to get weights :3
bruh idk
Source: Attention in transformers, visually explained | Chapter 6, Deep Learning, 3Blue1Brown
Scaled Dot Product Attention
We can now define the Attention operation mathematically as follows:
where , , and . Here, , and are all inputs to this operation—note that these are the outputs of learnable linear transformations, not the parameters themselves.
The actual learnable parameters are the weight matrices that produce Q, K, V:
Q = input @ W_Q # W_Q is learnable
K = input @ W_K # W_K is learnable
V = input @ W_V # W_V is learnableThe reason it’s instead of is — well this is the same row-major vs column-major issue I’ve encountered with linear layers. The mathematical notation assumes column vectors, but this implementation uses row-major tensors. Haha
Masking: It is sometimes convenient to mask1 the output of an attention operation.
A mask should have the shape , and each row of this boolean matrix indicates which keys the query should attend to. Canonically (and slightly confusingly), a value of True at position indicates that the query does attend to the key , and a value of False indicates that the query does not attend to the key. In other words, “information flows” at pairs with value True. So masks are like, the opposite of what a mask is, basically. For example, consider a mask matrix with entries . The single query vector attends only to the first two keys.
Masks are useful for a number of reasons. First, they prevent the model from “cheating” by seeing future tokens during training. They also prevent attention to meaningless padding tokens. We can also implement other constraints (like preventing attention across sentence boundaries). Neat!
They are not learned, casual masking is always the same lower triangular pattern—tril()—again this prevents the model from seeing future tokens. Padding masks are based on which positions are padding in your input, this varies per batch but not not learned. And too, fixed rules like “only attend to tokens within 512 positions”. Again, I reiterate, masks implement architectural constraints, not learned behaviors. Some advanced architectures do have learnable attention patterns, but standard transformer masks are fixed architectural choices.
The mask says "you CAN attend to these positions" but the learned weights determine "you SHOULD attend to these positions with this strength."
Computationally, it will be much more efficient to use masking than to compute attention on subsequences, and we can do this by taking the pre-softmax values and adding a in any entry of the mask matrix that is False.
Causal Multi-Head Self-Attention
We will implement multi-head self-attention as described in section 3.2.2 of Vaswani et al. (2017). Recall that, mathematically, the operation of applying multi-head attention is defined as follows:
with , , being slice number of size or of the embedding dimension for , , and respectively. With Attention being the scaled dot-product attention operation. From this we can form the multi-head self-attention operation:
Here, the learnable parameters are , , , and . Since the s, , and s are sliced in the multi-head attention operation, we can think of , and as being separated for each head along the output dimension. When you have this working, you should be computing the key, value, and query projections in a total of three matrix multiplies.
This implementation should prevent the model from attending to future tokens in the sequence, and RoPE should be applied to the query and key vectors, but not the value vectors. Also, the head dimension should be handled as a batch dimension, because in multi-head attention, attention is being applied independently for each head. This means that precisely the same RoPE rotation should be applied to the query and key vectors for each head.
8/30/24 9/16/25
Footnotes
-
def. A mask in attention is a boolean matrix that controls which tokens can attend to which other tokens. ↩