The Transformer Architecture Explained: The 8-Page Paper That Changed AI Forever
The paper that introduced the transformer architecture has over 93,000 citations. GPT, BERT, Claude, Gemini, LLaMA, Stable Diffusion — every major AI system in production today is built on a design first described by eight Google researchers in 2017. The paper was 15 pages long. It replaced the dominant architecture of the previous decade.
That’s not how machine learning typically works. Architectures don’t get replaced. They get refined, extended, combined. The transformer didn’t get combined with RNNs. It made them obsolete.
Understanding why requires understanding what was actually wrong with recurrent networks, what the transformer replaced it with, and why those specific design decisions were made the way they were — because the “why” is what most explanations skip.
Table of Contents
What Was Actually Broken About RNNs
Before the transformer, the dominant approach for any task involving sequences — translation, summarization, language modeling — was the recurrent neural network, specifically LSTMs and GRUs. The architecture worked by processing sequences one token at a time, left to right, maintaining a hidden state that carried information forward.
The problem wasn’t that RNNs were wrong. The problem was structural. Because each step depended on the output of the previous step, the computation was fundamentally sequential. You couldn’t compute position 10 until you’d finished computing position 9. This meant training couldn’t be parallelized meaningfully across the length of a sequence. At longer sequence lengths, this became a hard bottleneck — memory constraints limited batching, and training times scaled badly.
There was a second, subtler problem. When a model processes a long sequence step by step, information from early tokens has to travel through every subsequent hidden state to influence a prediction at the end. This creates long dependency paths. In practice, learning relationships between words far apart in a sentence was genuinely difficult because the gradient signal had to travel the full length of the sequence backward through time. LSTMs were designed to address this, and they helped, but the fundamental path length problem remained.
The transformer paper quantified this directly. In an RNN, the number of operations needed to relate any two positions in a sequence is O(n) — it scales with sequence length. In a transformer, with self-attention, it’s O(1). Every position can directly attend to every other position in a single operation. That’s the core insight the paper built on.
Self-Attention: What It Actually Computes
Self-attention is the mechanism that allows a transformer to directly relate any two positions in a sequence, regardless of how far apart they are. The paper’s formulation is precise and worth understanding in full.
At its core, attention is a function that takes a query and a set of key-value pairs and produces an output. The output is a weighted sum of the values, where the weight for each value is determined by how compatible its key is with the query.
The specific variant the paper introduces is called Scaled Dot-Product Attention:
Attention(Q, K, V) = softmax(QKᵀ / √dk) · VWhere Q is the matrix of queries, K is the matrix of keys, V is the matrix of values, and dk is the dimension of the keys.
The matrix multiplication QKᵀ produces a score for every pair of positions in the sequence — a measure of how relevant each position is to every other position. These scores are passed through softmax to get a probability distribution, then used to weight a sum over the values V.
The scaling factor — dividing by √dk — is not decorative. The paper explains the reason explicitly: when dk is large, the dot products grow large in magnitude. Large dot products push the softmax function into regions where its gradients become extremely small. You’re essentially in a near-flat region of the function, which makes training slow or unstable. Dividing by √dk keeps the dot products in a range where gradients remain useful. This is one of those decisions that looks arbitrary until you understand the gradient mechanics behind it, and it’s the kind of thing that only shows up clearly in the original paper.
Why Multiple Heads?
A single attention operation gives you one perspective on the sequence. But natural language — and most sequences — contains multiple simultaneous patterns. Word order. Syntax. Coreference. Semantic similarity. A single attention function, averaging over all of these, loses the ability to capture them separately.
Multi-head attention addresses this by running the attention function multiple times in parallel, each time with different learned projections of the queries, keys, and values. In the original paper, the model uses 8 parallel attention heads, each operating in a 64-dimensional subspace (the full model dimension is 512, divided by 8 heads).
The outputs from all 8 heads are concatenated and projected back to the full model dimension. The total computational cost is roughly equivalent to single-head attention at full dimensionality, because each individual head works at lower dimension.
The key claim in the paper: multi-head attention “allows the model to jointly attend to information from different representation subspaces at different positions.” A single head can’t do this because averaging collapses those subspaces together. With multiple heads, the model can simultaneously track syntactic structure in one head, semantic similarity in another, and positional proximity in a third.
The Problem Attention Doesn’t Solve By Itself
Self-attention, as described above, is completely order-agnostic. The score between position 1 and position 10 is computed the same way as the score between position 1 and position 2. The sequence “the cat sat on the mat” and “mat the on sat cat the” would produce the same attention scores if you stripped away positional information.
This matters because transformers process the full input simultaneously, not sequentially. Unlike an RNN, there’s no inherent ordering in how tokens are processed. Order information has to be injected explicitly.
The solution is positional encoding. Before the input sequence is processed, a vector is added to each token’s embedding that encodes its position. The paper uses sinusoidal functions with different frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i / dmodel))
PE(pos, 2i+1) = cos(pos / 10000^(2i / dmodel))The sinusoidal choice was deliberate, and the paper makes the design reasoning explicit. Sinusoidal functions have the property that PE(pos + k) can be expressed as a linear function of PE(pos). This means the model can potentially learn to attend by relative positions. More practically, because the encoding is based on continuous functions rather than learned embeddings, the model can generalize to sequence lengths it didn’t see during training — the positional encoding for position 1000 can be computed even if the training data only went to position 512.
The Encoder-Decoder Structure
The full transformer architecture follows an encoder-decoder design, which was already common in sequence-to-sequence tasks before the paper. The transformer’s contribution was replacing every recurrent component in that design with attention.
The encoder takes the input sequence and produces a rich contextual representation. It’s composed of 6 identical layers stacked on top of each other. Each layer has two sub-layers: a multi-head self-attention mechanism, and a position-wise feed-forward network. Each sub-layer is wrapped with a residual connection and layer normalization.
The decoder generates the output sequence token by token. It’s also 6 stacked layers, but each decoder layer has three sub-layers, not two. The extra sub-layer is what makes the decoder different: it performs cross-attention over the encoder’s output. The queries come from the decoder’s previous layer; the keys and values come from the encoder’s output. This is how the decoder reads and focuses on the input while generating each output token.
The decoder also uses masked self-attention in its first sub-layer. During training, the decoder can see the full target sequence — but when generating position i, it should only be able to use the tokens at positions 1 through i-1, not i+1 onward. Without the mask, the model could “cheat” by looking at future tokens. The mask prevents this by setting the attention scores for illegal positions to negative infinity before the softmax, effectively zeroing them out. This is what preserves the auto-regressive property: each output position is generated based only on previously generated outputs.
The Three Variants That Came Out of It
Understanding the original encoder-decoder architecture makes BERT and GPT easier to understand — because they’re not new architectures. They’re deliberate simplifications of the original.
BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder stack. Because the encoder attends to the full input sequence in both directions simultaneously, BERT is deeply bidirectional. This makes it excellent for tasks where understanding the full context of an input matters: classification, named entity recognition, question answering. What BERT can’t do naturally is generate text, because it has no decoder and no auto-regressive generation mechanism.
GPT uses only the decoder stack, with its masked self-attention. Because each position can only attend to previous positions, GPT is naturally set up for left-to-right text generation. Pre-train it on next-token prediction, and you get a model that can generate coherent sequences. GPT gives up bidirectionality to gain generative capability.
The original translation model uses both stacks. The encoder reads the source language; the decoder generates the target language, attending to the encoder’s output via cross-attention at every step.
These aren’t three different ideas. They’re three different ways to use the same core components depending on what task you need.
What the Paper’s Results Actually Showed
The transformer’s performance numbers from the original paper deserve a second. On WMT 2014 English-to-German translation, the model achieved a BLEU score of 28.4, improving over the then-best ensemble results by more than 2 BLEU points. On English-to-French, it hit 41.8 BLEU — a new single-model state of the art at the time.
The training cost is what makes these numbers striking. The English-to-French model trained in 3.5 days on 8 GPUs. The previous best results had been achieved with recurrent models that required significantly more compute. The transformer was both better and faster to train.
The paper also tested the model on English constituency parsing, a structurally different task from translation, to show the architecture generalizes beyond the specific problem it was designed for. It performed well there too.
The full paper is worth reading directly: Attention Is All You Need (Vaswani et al., 2017). The architecture section is dense but precise, and the “Why Self-Attention” section explains the design tradeoffs in terms of path length, computational cost, and parallelization in a way that no blog explanation fully captures.
What the Architecture Doesn’t Do
One thing worth being clear-eyed about: the original transformer architecture as described in the paper has quadratic memory complexity in the sequence length. The attention matrix has to store a score for every pair of positions, which scales as O(n²). For the translation tasks in the paper, this wasn’t a problem. For modern language models with context windows in the tens or hundreds of thousands of tokens, it became the central engineering challenge.
Sparse attention, linear attention approximations, and various other modifications have addressed this, but those are modifications of the original design. The fundamental architecture remains: embeddings plus positional encoding, stacked layers of self-attention and feed-forward networks, with residual connections and layer normalization throughout.
For a deeper look at how the attention mechanism specifically computes relationships between tokens and why the query-key-value framework is the right abstraction, the attention mechanism post goes into that in more detail.
FAQ
What is the transformer architecture used for?
The transformer architecture is used as the foundation for virtually every large-scale natural language processing system, including language models, translation systems, summarization tools, and code generation models. It’s also been extended to computer vision (Vision Transformer), protein structure prediction (AlphaFold), and audio generation. The architecture’s core strengths — parallel computation and direct modeling of long-range dependencies — make it broadly applicable wherever sequences or structured relationships need to be processed.
What is the difference between the transformer encoder and decoder?
The encoder processes the full input sequence simultaneously and produces a rich contextual representation that captures relationships between all tokens. The decoder generates output tokens auto-regressively, one at a time, using masked self-attention to prevent looking at future positions, and cross-attention to read from the encoder’s output. BERT uses only the encoder; GPT uses only the decoder; the original translation model uses both.
Why did transformers replace RNNs?
Two main reasons. First, RNNs process sequences sequentially, which prevents parallelization during training and creates a hard bottleneck at long sequence lengths. Transformers process the full input simultaneously, enabling far more efficient use of modern parallel hardware. Second, RNNs require information to travel through every intermediate hidden state to relate distant positions, making long-range dependencies genuinely difficult to learn. Self-attention directly connects any two positions in a single operation, regardless of distance, reducing the maximum path length for gradient signals from O(n) to O(1).
The transformer didn’t succeed because it was clever. It succeeded because it identified the right constraints — parallelization and path length — and made design decisions that directly attacked them. The scaling factor in attention, the sinusoidal positional encoding, the masking in the decoder: each of these was a deliberate response to a specific problem in the architecture. That’s what makes the paper still worth reading eight years later. The decisions are explained, not just presented.

