Jan 9, 2026

Mojo-tokenizer: Fastest AI Token Output → Readable Text on Apple Silicon?

mojo-tokenizer decodes tokens at 144M tokens/sec on Apple Silicon — 3.1x faster than tiktoken and 1.2x faster than rs-bpe.

We benchmarked 3 BPE implementations across 5 diverse text files (23 MB total) using the GPT-4 tokenizer vocabulary. Every claim is backed by reproducible benchmarks with statistical analysis.

The Numbers

Implementation	Language	Decoding	Encoding (full pipeline)
mojo-tokenizer	Mojo	144 M/s	8.0 M/s
rs-bpe	Rust	121 M/s	— (raw BPE only)
tiktoken	Rust/Python	47 M/s	5.1 M/s

20 iterations, 3 warmup. sherlock.txt (607 KB, 143K tokens). Apple Silicon.

Note: tiktoken is implemented in Rust with Python bindings — we’re comparing against Rust-accelerated code, not pure Python.

Why decoding matters: Every LLM inference call decodes output tokens. At 144M tok/s, mojo-tokenizer can decode GPT-4’s entire 128K context window in under 1ms.

Decoding Performance Across Files

File	Type	mojo-tokenizer	rs-bpe	tiktoken	vs tiktoken
sherlock.txt	Literary	144 M/s	121 M/s	47 M/s	3.1x
war_and_peace.txt	Literary	141 M/s	114 M/s	43 M/s	3.3x
les_miserables.txt	Literary	140 M/s	122 M/s	43 M/s	3.3x
arxiv_abstracts.txt	Scientific	117 M/s	105 M/s	42 M/s	2.8x
wikitext2_train.txt	Encyclopedia	128 M/s	116 M/s	45 M/s	2.8x

Speedup is consistent: 2.8x to 3.3x across all content types.

Encoding Performance (Fair Comparison)

Implementation	Full Pipeline	Raw BPE
mojo-tokenizer	8.0 M/s	9.2 M/s
rs-bpe (Rust)	—	10.0 M/s
tiktoken (Rust/Py)	5.1 M/s	—

Note: rs-bpe’s raw BPE is faster (10.0 M/s). But production tokenizers need pretokenization to match tiktoken output. For the full pipeline, mojo-tokenizer beats tiktoken by 1.6x.

What We Test

Metric	Description
Vocabulary	cl100k_base (100,256 tokens) — GPT-4, ChatGPT
Test data	5 files: 607KB to 10.7MB (23MB total, 5.3M tokens)
Algorithm	O(n) backtracking BPE
Validation	100% exact token match with tiktoken on all files

Supported Formats

mojo-tokenizer currently focuses on OpenAI-style BPE tokenization:

Encoding	Vocab Size	Models	Status
o200k_base	199,998	gpt-5.2, gpt-oss-120B, gpt-oss-20B	✓ Verified
HuggingFace BPE	varies	Qwen, Llama, Mistral	Experimental

"Hello, world!" → [13225, 11, 2375, 0]  (o200k_base)

mojo-tokenizer produces exact token matches with tiktoken on o200k_base. HuggingFace BPE format loading is implemented but not yet validated against HuggingFace tokenizers.

How It Works

FlatTokenStorage: The Decoding Secret

The key to 144M tok/s decoding is FlatTokenStorage — a flat byte array containing all token bytes contiguously:

Traditional storage:
  token_0 → [72, 101, 108, 108, 111]      # "Hello" - separate allocation
  token_1 → [32, 119, 111, 114, 108, 100] # " world" - separate allocation
  ...100K more allocations...

FlatTokenStorage:
  data:    [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, ...]
  offsets: [0, 5, 11, ...]  # Where each token starts

Decoding becomes a series of memcpy() calls instead of 100K pointer dereferences:

# For each token ID, copy its bytes directly
memcpy(dest_ptr, flat_data + offsets[token_id], lengths[token_id])

Source: src/flat_token_storage.mojo

O(n) Backtracking BPE (Encoding)

Ported from rs-bpe. Single forward pass with merge table instead of priority queue:

Traditional BPE: O(n log n) - priority queue for merges
Backtracking BPE: O(n) - single pass with precomputed tables

PairCache1000: O(1) Merge Lookup

For token pairs where both < 1000 (covers ~80% of merges):

# HashMap: merge_rank = self.merges.get(pair_key)  # O(1) amortized
# Array:   merge_rank = self.pair_cache[t1][t2]    # O(1) guaranteed

4MB 2D array eliminates hash overhead. +21% encoding speedup.

Zero-Copy Borrowed References

BacktrackEncoderRef borrows vocabulary instead of copying:

struct BacktrackEncoderRef[origin: Origin]:
    var vocab: UnsafePointer[Vocabulary, origin]  # Borrowed

Eliminates ~105MB copy per encode call. Compiler-enforced lifetime safety.

Statistical Rigor

All measurements include:

20 iterations per file (3 warmup)
Standard deviation (CV < 5% indicates stable results)
Percentiles (p50, p95, p99)
5 diverse text types (literary, scientific, encyclopedic)

sherlock.txt decoding:
  Mean: 143.7 +/- 4.3 M/s
  p50=144.4, p95=148.0, p99=148.3 M/s
  CV=3.0%

When Tokenization Matters

Cloud APIs (OpenAI, Claude, etc.) handle tokenization internally — you send text, they tokenize behind the scenes. But fast tokenization matters when:

Running models locally — MLX, PyTorch, or llama.cpp require you to tokenize before inference
Context window management — Counting tokens to fit within limits (128K for GPT-4, 200K for Claude)
Cost estimation — Counting tokens before API calls to estimate costs
Token-level operations — Prompt caching, continuation from specific positions

Example: Local Inference Pipeline

Text → [Tokenizer] → Token IDs → [LLM] → Output IDs → [Tokenizer] → Text
         144M tok/s decode ↑                      ↑ 8M tok/s encode

At 144M tok/s, tokenizer overhead becomes negligible compared to model inference time.

Limitations

Apple Silicon only — Results may differ on x86
English-heavy data — CJK, Arabic, emoji not tested
Vocabulary load time — mojo-tokenizer: ~800ms vs tiktoken: ~65ms
Memory usage — mojo-tokenizer: ~10MB vs tiktoken: ~3MB

Try It

git clone https://github.com/atsentia/mojo-tokenizer
cd mojo-tokenizer

# Run comprehensive benchmark
mojo run bench_comprehensive.mojo

Full methodology: docs/COMPREHENSIVE_BENCHMARK_RESULTS.md

These results are from Apple Silicon. We’d welcome community benchmarks on different platforms.

Appendix: Why Lead With Decoding?

Decoding is simpler than encoding — just look up bytes by token ID. But it’s the operation that happens on every LLM output token, making it latency-critical.

At 144M tok/s:

GPT-4’s 128K context decodes in 0.9ms
A typical 500-token response decodes in 3.5us

This makes tokenizer decoding effectively free compared to model inference time.

Appendix: Raw BPE vs Full Pipeline

You might notice rs-bpe’s raw BPE (10.0 M/s) beats mojo-tokenizer (9.2 M/s). We don’t lead with this because:

Raw BPE produces different tokens — without pretokenization, you get valid BPE but not tiktoken-compatible output
Production needs pretokenization — the regex patterns that split text before BPE
Full pipeline is what you’d deploy — mojo-tokenizer’s 8.0 M/s vs tiktoken’s 5.1 M/s is the real comparison

We’re honest about where Rust wins. Our advantage is in decoding and the full encoding pipeline.

Most trending models on HuggingFace use BPE tokenization:

Model	Tokenizer	Vocab Size
LiquidAI/LFM2.5-1.2B-Instruct	BPE	64,400
MiniMaxAI/MiniMax-M2.1	GPT2/BPE	200,000
naver-hyperclovax/HyperCLOVAX-SEED-Think-32B	GPT2/BPE	128,000
LGAI-EXAONE/K-EXAONE-236B-A23B	BPE	153,600
IQuestLab/IQuest-Coder-V1-40B	BPE	75,858
tencent/HY-MT1.5-1.8B	BPE	120,000
Qwen/Qwen2-1.5B	BPE	151,643
mistralai/Mistral-7B-v0.1	BPE	32,000
nvidia/nemotron-speech (ASR)	SentencePiece	—
Lightricks/LTX-Video	Visual encoder	—

Status: These models use BPE format. HuggingFace BPE loading in mojo-tokenizer is not yet validated — the loader parses tokenizer.json but doesn’t produce matching tokens yet. Use tiktoken/o200k_base for production (verified to match exactly).

Multimodal models (Qwen-VL, LTX-Video) use BPE only for text. Images/videos use vision encoders.

mojo-tokenizer is early-stage software. While benchmarks are rigorous, the API may change. We welcome contributions and bug reports at github.com/atsentia/mojo-tokenizer.

Amund Tveit

AI Engineer & Advisor