Mojo-tokenizer: Fastest AI Token Output → Readable Text on Apple Silicon?


mojo-tokenizer decodes tokens at 144M tokens/sec on Apple Silicon — 3.1x faster than tiktoken and 1.2x faster than rs-bpe.

We benchmarked 3 BPE implementations across 5 diverse text files (23 MB total) using the GPT-4 tokenizer vocabulary. Every claim is backed by reproducible benchmarks with statistical analysis.

The Numbers

ImplementationLanguageDecodingEncoding (full pipeline)
mojo-tokenizerMojo144 M/s8.0 M/s
rs-bpeRust121 M/s— (raw BPE only)
tiktokenRust/Python47 M/s5.1 M/s

20 iterations, 3 warmup. sherlock.txt (607 KB, 143K tokens). Apple Silicon.

Note: tiktoken is implemented in Rust with Python bindings — we’re comparing against Rust-accelerated code, not pure Python.

Why decoding matters: Every LLM inference call decodes output tokens. At 144M tok/s, mojo-tokenizer can decode GPT-4’s entire 128K context window in under 1ms.

Decoding Performance Across Files

FileTypemojo-tokenizerrs-bpetiktokenvs tiktoken
sherlock.txtLiterary144 M/s121 M/s47 M/s3.1x
war_and_peace.txtLiterary141 M/s114 M/s43 M/s3.3x
les_miserables.txtLiterary140 M/s122 M/s43 M/s3.3x
arxiv_abstracts.txtScientific117 M/s105 M/s42 M/s2.8x
wikitext2_train.txtEncyclopedia128 M/s116 M/s45 M/s2.8x

Speedup is consistent: 2.8x to 3.3x across all content types.

Encoding Performance (Fair Comparison)

ImplementationFull PipelineRaw BPE
mojo-tokenizer8.0 M/s9.2 M/s
rs-bpe (Rust)10.0 M/s
tiktoken (Rust/Py)5.1 M/s

Note: rs-bpe’s raw BPE is faster (10.0 M/s). But production tokenizers need pretokenization to match tiktoken output. For the full pipeline, mojo-tokenizer beats tiktoken by 1.6x.

What We Test

MetricDescription
Vocabularycl100k_base (100,256 tokens) — GPT-4, ChatGPT
Test data5 files: 607KB to 10.7MB (23MB total, 5.3M tokens)
AlgorithmO(n) backtracking BPE
Validation100% exact token match with tiktoken on all files

Supported Formats

mojo-tokenizer currently focuses on OpenAI-style BPE tokenization:

EncodingVocab SizeModelsStatus
o200k_base199,998gpt-5.2, gpt-oss-120B, gpt-oss-20B✓ Verified
HuggingFace BPEvariesQwen, Llama, MistralExperimental
"Hello, world!" → [13225, 11, 2375, 0]  (o200k_base)

mojo-tokenizer produces exact token matches with tiktoken on o200k_base. HuggingFace BPE format loading is implemented but not yet validated against HuggingFace tokenizers.

How It Works

FlatTokenStorage: The Decoding Secret

The key to 144M tok/s decoding is FlatTokenStorage — a flat byte array containing all token bytes contiguously:

Traditional storage:
  token_0 → [72, 101, 108, 108, 111]      # "Hello" - separate allocation
  token_1 → [32, 119, 111, 114, 108, 100] # " world" - separate allocation
  ...100K more allocations...

FlatTokenStorage:
  data:    [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, ...]
  offsets: [0, 5, 11, ...]  # Where each token starts

Decoding becomes a series of memcpy() calls instead of 100K pointer dereferences:

# For each token ID, copy its bytes directly
memcpy(dest_ptr, flat_data + offsets[token_id], lengths[token_id])

Source: src/flat_token_storage.mojo

O(n) Backtracking BPE (Encoding)

Ported from rs-bpe. Single forward pass with merge table instead of priority queue:

Traditional BPE: O(n log n) - priority queue for merges
Backtracking BPE: O(n) - single pass with precomputed tables

PairCache1000: O(1) Merge Lookup

For token pairs where both < 1000 (covers ~80% of merges):

# HashMap: merge_rank = self.merges.get(pair_key)  # O(1) amortized
# Array:   merge_rank = self.pair_cache[t1][t2]    # O(1) guaranteed

4MB 2D array eliminates hash overhead. +21% encoding speedup.

Zero-Copy Borrowed References

BacktrackEncoderRef borrows vocabulary instead of copying:

struct BacktrackEncoderRef[origin: Origin]:
    var vocab: UnsafePointer[Vocabulary, origin]  # Borrowed

Eliminates ~105MB copy per encode call. Compiler-enforced lifetime safety.

Statistical Rigor

All measurements include:

  • 20 iterations per file (3 warmup)
  • Standard deviation (CV < 5% indicates stable results)
  • Percentiles (p50, p95, p99)
  • 5 diverse text types (literary, scientific, encyclopedic)
sherlock.txt decoding:
  Mean: 143.7 +/- 4.3 M/s
  p50=144.4, p95=148.0, p99=148.3 M/s
  CV=3.0%

When Tokenization Matters

Cloud APIs (OpenAI, Claude, etc.) handle tokenization internally — you send text, they tokenize behind the scenes. But fast tokenization matters when:

  1. Running models locally — MLX, PyTorch, or llama.cpp require you to tokenize before inference
  2. Context window management — Counting tokens to fit within limits (128K for GPT-4, 200K for Claude)
  3. Cost estimation — Counting tokens before API calls to estimate costs
  4. Token-level operations — Prompt caching, continuation from specific positions

Example: Local Inference Pipeline

Text → [Tokenizer] → Token IDs → [LLM] → Output IDs → [Tokenizer] → Text
         144M tok/s decode ↑                      ↑ 8M tok/s encode

At 144M tok/s, tokenizer overhead becomes negligible compared to model inference time.

Limitations

  1. Apple Silicon only — Results may differ on x86
  2. English-heavy data — CJK, Arabic, emoji not tested
  3. Vocabulary load time — mojo-tokenizer: ~800ms vs tiktoken: ~65ms
  4. Memory usage — mojo-tokenizer: ~10MB vs tiktoken: ~3MB

Try It

git clone https://github.com/atsentia/mojo-tokenizer
cd mojo-tokenizer

# Run comprehensive benchmark
mojo run bench_comprehensive.mojo

Full methodology: docs/COMPREHENSIVE_BENCHMARK_RESULTS.md

These results are from Apple Silicon. We’d welcome community benchmarks on different platforms.


Appendix: Why Lead With Decoding?

Decoding is simpler than encoding — just look up bytes by token ID. But it’s the operation that happens on every LLM output token, making it latency-critical.

At 144M tok/s:

  • GPT-4’s 128K context decodes in 0.9ms
  • A typical 500-token response decodes in 3.5us

This makes tokenizer decoding effectively free compared to model inference time.

Appendix: Raw BPE vs Full Pipeline

You might notice rs-bpe’s raw BPE (10.0 M/s) beats mojo-tokenizer (9.2 M/s). We don’t lead with this because:

  1. Raw BPE produces different tokens — without pretokenization, you get valid BPE but not tiktoken-compatible output
  2. Production needs pretokenization — the regex patterns that split text before BPE
  3. Full pipeline is what you’d deploy — mojo-tokenizer’s 8.0 M/s vs tiktoken’s 5.1 M/s is the real comparison

We’re honest about where Rust wins. Our advantage is in decoding and the full encoding pipeline.

Most trending models on HuggingFace use BPE tokenization:

ModelTokenizerVocab Size
LiquidAI/LFM2.5-1.2B-InstructBPE64,400
MiniMaxAI/MiniMax-M2.1GPT2/BPE200,000
naver-hyperclovax/HyperCLOVAX-SEED-Think-32BGPT2/BPE128,000
LGAI-EXAONE/K-EXAONE-236B-A23BBPE153,600
IQuestLab/IQuest-Coder-V1-40BBPE75,858
tencent/HY-MT1.5-1.8BBPE120,000
Qwen/Qwen2-1.5BBPE151,643
mistralai/Mistral-7B-v0.1BPE32,000
nvidia/nemotron-speech (ASR)SentencePiece
Lightricks/LTX-VideoVisual encoder

Status: These models use BPE format. HuggingFace BPE loading in mojo-tokenizer is not yet validated — the loader parses tokenizer.json but doesn’t produce matching tokens yet. Use tiktoken/o200k_base for production (verified to match exactly).

Multimodal models (Qwen-VL, LTX-Video) use BPE only for text. Images/videos use vision encoders.


mojo-tokenizer is early-stage software. While benchmarks are rigorous, the API may change. We welcome contributions and bug reports at github.com/atsentia/mojo-tokenizer.

Amund Tveit

AI Engineer & Advisor