Position-Wise Feed-Forward Network
The original Transformer paper had the the Transformer feed-forward network consist of two linear transformations with a ReLU1 activation between them. We’re not gonna do that, instead taking after more modern LLMs (such as LLaMA 3 and Qwen 2.5) by combining the SiLU (often called Swish) activation with a gating mechanism called a Gated Linear Unit (GLU). This gets us SwiGLU. We will also omit the bias terms sometimes used in linear layers, following most modern LLMs since PaLM. The SiLU or Swish activation function is defined as follows:
Note that its similar to the ReLU function but is smooth at zero. Now, Gated Linear Units (GLUs) were originally defined by Dauphin et al. (2017) as the element-wise product of a linear transformation passed through a sigmoid function and another linear transformation:
Note that represents element-wise multiplication.
Gated Linear Units are suggested to “reduce the vanishing gradient problem for deep architectures by providing a linear path for the gradients while retaining non-linear capabilities.2” Putting the SiLU/Swish and GLU together, we get the SwiGLU, which we will use for our feed-forward networks:
Where,
, btw, merely means the inner feed-forward dimension, typically larger than —and by convention defined as above in most cases.
9/12/25