CWT Multilingual BPE Tokenizer Ladder (32k / 64k / 128k / 256k)

Byte-level BPE tokenizers at four vocabulary sizes, trained on a deterministic, reproducible multilingual-including-English corpus: FineWeb v1 (English) + FineWeb-2 (non-English) — the FineWeb-2 authors' own recommended way to get full coverage including English. Built for controlled vocabulary-scaling studies where a 128k–256k vocab must be genuinely occupied (not English dead-tokens).

Occupancy / fertility (held-out)

vocab occupancy dead tokens fertility (subwords/word)
32,000 0.997 92 2.54
64,000 0.995 330 2.28
128,000 0.983 2,183 2.07
256,000 0.951 12,500 1.91

Corpus

  • English: HuggingFaceFW/fineweb (sample-10BT), rev 9bb295ddab0e
  • Non-English (28 langs): HuggingFaceFW/fineweb-2, rev af9c13333eb9
  • English share: 40%; total 200,000 docs. Languages: arb_Arab, ben_Beng, ces_Latn, cmn_Hani, dan_Latn, deu_Latn, ell_Grek, fas_Arab, fin_Latn, fra_Latn, heb_Hebr, hin_Deva, hun_Latn, ind_Latn, ita_Latn, jpn_Jpan, kor_Hang, nld_Latn, pol_Latn, por_Latn, ron_Latn, rus_Cyrl, spa_Latn, swe_Latn, tha_Thai, tur_Latn, ukr_Cyrl, vie_Latn
  • Reproducible corpus: DaveGabe/cwt-multilingual-pretrain-mix (no RNG; first-N of pinned revisions).

Use

from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(tokenizer_file="fineweb_mix_256k/tokenizer.json")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support