quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]

Been working on this a while! Should be useful for anyone trying to speed up their tokenization workflows.

quicktok is a fast/exact BPE tokenizer written in C++. Token ids are byte-identical to tiktoken and encoding runs 2–3.6× faster than bpe-openai (the fastest alternative I know of) and 4–11× faster than tiktoken itself. It ships cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3.

Approach. Same algorithm as bpe-openai (exact backtracking BPE) but I apply lots of data structure engineering to cut memory accesses:

A 2-byte trie is used for the longest-match walk
Dense exactly-keyed caches are used for merge-validity checks
A hand-compiled pretokenizer is used instead of a general regex engine

Benchmarks (Apple M1, single thread, MB/s, cl100k_base and every output verified token-for-token before timing):

encoder	The Pile	Code	Common Crawl
quicktok (native)	121.7	139.2	71.3
quicktok (Python)	77.9	83.6	49.7
bpe-openai	36.6	38.7	28.9
rs-bpe	30.9	34.7	23.5
tiktoken-rs	15.4	13.8	13.3
tiktoken (Python)	13.6	12.8	12.3
TokenDagger	11.1	11.9	10.7

o200k_base is similar in ratios. Each encoder is called through its own raw API and benchmarks can be reproduced with make bench-compare in the repo.

pip install quicktok-v1

Repo: https://github.com/dmatth1/quicktok

submitted by /u/_casa_nova_
[link] [comments]

quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]

Want to read more?

Tagged with