2 min readfrom Machine Learning

quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]

Been working on this a while! Should be useful for anyone trying to speed up their tokenization workflows.

quicktok is a fast/exact BPE tokenizer written in C++. Token ids are byte-identical to tiktoken and encoding runs 2–3.6× faster than bpe-openai (the fastest alternative I know of) and 4–11× faster than tiktoken itself. It ships cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3.

Approach. Same algorithm as bpe-openai (exact backtracking BPE) but I apply lots of data structure engineering to cut memory accesses:

  • A 2-byte trie is used for the longest-match walk
  • Dense exactly-keyed caches are used for merge-validity checks
  • A hand-compiled pretokenizer is used instead of a general regex engine

Benchmarks (Apple M1, single thread, MB/s, cl100k_base and every output verified token-for-token before timing):

encoder The Pile Code Common Crawl
quicktok (native) 121.7 139.2 71.3
quicktok (Python) 77.9 83.6 49.7
bpe-openai 36.6 38.7 28.9
rs-bpe 30.9 34.7 23.5
tiktoken-rs 15.4 13.8 13.3
tiktoken (Python) 13.6 12.8 12.3
TokenDagger 11.1 11.9 10.7

o200k_base is similar in ratios. Each encoder is called through its own raw API and benchmarks can be reproduced with make bench-compare in the repo.

pip install quicktok-v1

Repo: https://github.com/dmatth1/quicktok

submitted by /u/_casa_nova_
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#no-code spreadsheet solutions
#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing for spreadsheets
#rows.com
#financial modeling with spreadsheets
#big data management in spreadsheets
#AI-native spreadsheets
#conversational data analysis
#real-time data collaboration
#automation in spreadsheet workflows
#intelligent data visualization
#cloud-native spreadsheets
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#spreadsheet API integration
#data cleaning solutions
#quicktok