How Tokenizers Are Built
Tokenization explained what a token is — the discrete unit an LLM actually reads. This page is about the tokenizer that produces them: how it's built, the three approaches you can take, and why nearly every modern LLM lands on the same one — subword tokenization built with Byte-Pair Encoding (BPE).
Two phases: training and inference
A tokenizer has two distinct moments in its life: the training phase, where its vocabulary is built once, and the inference phase, where that fixed vocabulary is used over and over.
Training phase
During training you feed the tokenizer a huge pile of cleaned text. It splits that text into small units, then collects every unique unit into a vocabulary — a table that maps each token to an integer ID.
The first step is text splitting: break the long stream of text into smaller pieces. The second is building the vocabulary: gather the unique pieces and assign each one an ID. Every unique unit is what we call a token, and the token-to-ID table is the vocabulary.
Don't confuse training the tokenizer with training the model. Building the vocabulary just produces that lookup table. The model is trained separately — it learns statistical patterns between token IDs by optimizing billions of neural-network parameters.
Inference phase
Once the vocabulary exists, it's frozen. At inference time the tokenizer just applies it — in both directions:
When you send text to an LLM, the tokenizer encodes it into a sequence of token IDs. The model turns those IDs into embedding vectors, runs its computation, and predicts the next token ID one at a time. The tokenizer then decodes the generated IDs back into text.
A tokenizer is reversible: encoding turns text into IDs, decoding turns IDs back into text. The same vocabulary drives both directions. The model itself never works with words — only with the numeric IDs and their embeddings.
Three families of tokenizer
The part that actually differs between LLMs is the splitting algorithm — how text becomes tokens. There are three broad approaches:
Word-level and character-level tokenizers are rarely used to train modern frontier models. Almost every current LLM uses subword-level tokenization — the other two have drawbacks that make them a poor fit at scale.
Word-level
A word-level tokenizer splits text on whitespace and punctuation, so each token is a whole word. Perfectly fine becomes [Perfectly, fine], then maps to IDs like [52141, 7060].
It looks natural, but the vocabulary explodes. The internet contains an enormous number of distinct words — across languages, plus jargon, company and product names, personal names, URLs, typos, and code. The vocabulary balloons to hundreds of thousands or even millions of entries:
| Token | ID |
|---|---|
| a | 0 |
| about | 1 |
| after | 2 |
| … | … |
| zebra | 270030 |
| … | … |
| ! | 270131 |
A huge vocabulary inflates memory: the embedding matrix and output layer must cover every token. Worse is the out-of-vocabulary (OOV) problem — the model can hit a word it never saw in training and has no way to represent it.
Character-level
A character-level tokenizer splits text into individual characters. Perfectly fine becomes 14 tokens: [P, e, r, f, e, c, t, l, y, (space), f, i, n, e]. Each character — letter, digit, space, punctuation, or any Unicode symbol — gets an ID.
| Token | ID |
|---|---|
| a | 0 |
| b | 1 |
| … | … |
| A | 26 |
| … | … |
| ! | 57 |
| <SPACE> | 105 |
The vocabulary is tiny and there's no OOV problem. But sequences get very long — 14 tokens where word-level used 2. Long sequences mean far more computation, since the Transformer has to process and relate every token. The model also has to learn how characters combine into words and how those words carry meaning, which makes learning harder.
Subword-level
Subword-level tokenization is the compromise, and today's standard. Frequent words get their own token; rarer or unseen words are broken into smaller subword pieces. Perfectly fine might become [Perfect, ly, fine].
| Token | ID |
|---|---|
| the | 0 |
| of | 1 |
| home | 2 |
| … | … |
| ##ing | 50252 |
| ##ed | 50253 |
| ##able | 50254 |
| <EOS> | 50255 |
| <SPACE> | 50256 |
The ## prefix marks a piece that continues the previous token (so walking → [walk, ##ing]). This gives a moderate vocabulary, manageable sequence lengths, and graceful handling of brand-new words.
| Approach | Vocabulary size | Sequence length | Unknown words | Used for frontier models |
|---|---|---|---|---|
| Word-level | Very large (100K–millions) | Short | Breaks (OOV) | Rarely |
| Character-level | Tiny (~hundreds) | Very long | No OOV, but no word sense | Rarely |
| Subword-level | Moderate (50K–200K) | Moderate | Composed from subwords | Standard |
Modern LLMs typically use vocabularies between 50,000 and 200,000 tokens — GPT-2 used ~50K, GPT-4's
cl100k~100K, and GPT-4o'so200k200K. It's a deliberate trade-off between vocabulary size and sequence length.
Byte-Pair Encoding, step by step
The most popular algorithm for building a subword vocabulary is Byte-Pair Encoding (BPE). (WordPiece and SentencePiece are close relatives.) BPE was originally a text-compression algorithm; OpenAI adopted it for tokenizing GPT.
The key idea: start from individual characters, then repeatedly merge the most frequent adjacent pair into a new, larger token.
Say the training data contains these words:
low
lower
lowestBPE first represents each word as a sequence of characters:
low → l o w
lower → l o w e r
lowest → l o w e s tIt then scans the whole corpus and counts how often each adjacent pair occurs. The goal isn't to find whole words — it's to find patterns that repeat often. The pair (l, o) appears in all three words, so it's merged into a new token lo:
lo w
lo w e r
lo w e s tIt recounts pairs and merges again. If (lo, w) is now the most frequent pair, it becomes low:
low
low e r
low e s tThis repeats thousands — sometimes tens of thousands — of times over huge amounts of text, until the vocabulary reaches its target size. Very frequent patterns earn their own tokens. Sometimes those are whole words (low, house, computer); sometimes they're word-parts (ing, tion, ment, able).
Why start from characters?
It seems odd that BPE starts from characters when character-level tokenization has the long-sequence problem. But characters are only the starting point for building the vocabulary — BPE's whole job is to merge them into larger, useful units. Take programming:
Character-level: p r o g r a m m i n g (11 tokens)
BPE: program ming (2 tokens)If programming is common enough, BPE may even keep it as a single token. So text that character-level tokenization would shatter into thousands of tokens, BPE often represents in far fewer. And a brand-new word the model never saw can still be built from existing subword pieces — running → [run, ning] — so there's no hard OOV wall.
Shorter sequences mean less memory, less computation, and more efficient training and inference. That's why BPE became the sweet spot between word-level and character-level approaches, and why most modern LLMs use some variant of subword tokenization.
BPE merges are learned at build time. The pair-counting and merging described above runs while the tokenizer is created — not every time you send a message. At inference the learned merges are simply applied to your text.
Don't build your own
You rarely need to build a tokenizer from scratch. Mature open-source tokenizers exist — the best known is OpenAI's tiktoken, a BPE tokenizer you can drop into your own training or inference pipeline.
To see tokenization happen live across different models, try the Tiktokenizer playground — paste text and watch how each tokenizer splits it and assigns IDs.
Further reading
- Neural Machine Translation of Rare Words with Subword Units — Sennrich, Haddow & Birch (2016), the paper that introduced BPE to NLP tokenization.
- Hugging Face NLP Course — Building a tokenizer, block by block — builds BPE, WordPiece, and Unigram tokenizers from scratch; the hands-on counterpart to "don't build your own."
- Hugging Face NLP Course — WordPiece tokenization — full walkthrough of WordPiece and the
##continuation prefix this page uses. - google/sentencepiece — Google's language-independent subword tokenizer (BPE + Unigram), one of the "close relatives" named above.
- Let's build the GPT Tokenizer — Karpathy codes BPE training and encode/decode end to end; the live version of this page's step-by-step.
Related: Tokenization · How LLMs Are Built
Edit this page on GitHub