Feed-Forward Networks

Position-Wise Feed-Forward Network

The original Transformer paper had the the Transformer feed-forward network consist of two linear transformations with a ReLU¹ activation between them. We’re not gonna do that, instead taking after more modern LLMs (such as LLaMA 3 and Qwen 2.5) by combining the SiLU (often called Swish) activation with a gating mechanism called a Gated Linear Unit (GLU). This gets us SwiGLU. We will also omit the bias terms sometimes used in linear layers, following most modern LLMs since PaLM. The SiLU or Swish activation function is defined as follows:

SiLU (x) = x \cdot σ (x), where σ (x) = \frac{1}{1 + e ^{- x}}

Note that its similar to the ReLU function but is smooth at zero. Now, Gated Linear Units (GLUs) were originally defined by Dauphin et al. (2017) as the element-wise product of a linear transformation passed through a sigmoid function and another linear transformation:

GLU (x, W_{1}, W_{2}) = σ (W_{1} x) ⊙ W_{2} x

Note that $⊙$ represents element-wise multiplication.

Gated Linear Units are suggested to “reduce the vanishing gradient problem for deep architectures by providing a linear path for the gradients while retaining non-linear capabilities.²” Putting the SiLU/Swish and GLU together, we get the SwiGLU, which we will use for our feed-forward networks:

FFN (x) = SwiGLU (x, W_{1}, W_{2}, W_{3}) = W_{2} (SiLU (W_{1} x) ⊙ W_{3} x)

Where,

x \in R^{d_{model}}, W_{1}, W_{3} \in R^{d_{ff} \times d_{model}}, W_{2} \in R^{d_{model} \times d_{ff}}, and canonically, d_{ff} = \frac{8}{3} d_{model}

$d_{ff}$ , btw, merely means the inner feed-forward dimension, typically larger than $d_{m o d e l}$ —and by convention defined as above in most cases.

9/12/25

def. $R e LU (x) = ma x (0, x)$ ↩
This is because Non-linear functions (like ReLU, sigmoid) can have derivatives that are zero or very small, causing vanishing gradients. But you need non-linearity for the network to learn complex patterns. ↩

vishwiki

Explorer

Feed-Forward Networks

Position-Wise Feed-Forward Network

Graph View

Backlinks

vishwiki

Explorer

Feed-Forward Networks

Position-Wise Feed-Forward Network

Footnotes

Graph View

Backlinks