CWT Multilingual BPE Tokenizer Ladder (32k / 64k / 128k / 256k)
Byte-level BPE tokenizers at four vocabulary sizes, trained on a deterministic, reproducible multilingual-including-English corpus: FineWeb v1 (English) + FineWeb-2 (non-English) — the FineWeb-2 authors' own recommended way to get full coverage including English. Built for controlled vocabulary-scaling studies where a 128k–256k vocab must be genuinely occupied (not English dead-tokens).
Occupancy / fertility (held-out)
| vocab | occupancy | dead tokens | fertility (subwords/word) |
|---|---|---|---|
| 32,000 | 0.997 | 92 | 2.54 |
| 64,000 | 0.995 | 330 | 2.28 |
| 128,000 | 0.983 | 2,183 | 2.07 |
| 256,000 | 0.951 | 12,500 | 1.91 |
Corpus
- English:
HuggingFaceFW/fineweb(sample-10BT), rev9bb295ddab0e - Non-English (28 langs):
HuggingFaceFW/fineweb-2, revaf9c13333eb9 - English share: 40%; total 200,000 docs. Languages: arb_Arab, ben_Beng, ces_Latn, cmn_Hani, dan_Latn, deu_Latn, ell_Grek, fas_Arab, fin_Latn, fra_Latn, heb_Hebr, hin_Deva, hun_Latn, ind_Latn, ita_Latn, jpn_Jpan, kor_Hang, nld_Latn, pol_Latn, por_Latn, ron_Latn, rus_Cyrl, spa_Latn, swe_Latn, tha_Thai, tur_Latn, ukr_Cyrl, vie_Latn
- Reproducible corpus: DaveGabe/cwt-multilingual-pretrain-mix (no RNG; first-N of pinned revisions).
Use
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(tokenizer_file="fineweb_mix_256k/tokenizer.json")
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support