Er-12M

Summary

Task: Text-Generation
Total training time: 5 days
Inputs: text
Outputs: text
Params: 12,497,520
Final Loss: 2.404
Important Benchmark Scores:
   1. ARC Easy - 34.89%
   2. BLiMP - 64.96%
   3. HellaSwag - 28.39%
   4. ArithMark-2.0 - 30.88%
Framework: PyTorch, transformers
Author: Paul Courneya, Jonathon Ly

Description

‘Er’ is a 12.4M-parameter Small Language Model trained on 34.8B tokens from a nine-source dataset. Its name, “Er,” is the reverse of “Re,” the prefix of Re:Zero – Starting Life in Another World, the light novel series that inspired the organization’s name.

Model Details

Architecture: Qwen3.5
Hidden Size: 280
Number of Layers: 12
Intermediate Size: 840 (a 3x expansion)
Number of Attention Heads: 8
Number of KV Heads: 2
Head Dim: 35
Vocab Size: 2564
Max Position Embeddings: 384
Total Parameters: 12,497,520

Training

Dataset

Source	Bytes (GB)	Share (%)	What it is
FineWeb-edu	35.0	28.2%	Educational-filtered Common Crawl
DCLM-Edu	20.0	16.1%	Educational-filtered webtext
The Pile Deduped	20.0	16.1%	Broad, diverse 23-source dataset
FineWeb-HQ	20.0	16.1%	Knowledge-filtered webtext
FineMath	13.0	10.5%	Math-filtered Common Crawl
Cosmopedia-v2	7.0	5.6%	Synthetic textbooks
Wikipedia	5.0	4.0%	Wikipedia articles
NpSetPython-Edu	3.5	2.8%	Normalized Python code
Misc	0.6	0.5%	LessWrong + HF configs + HF dataset/model cards

Training Details

Maximum Learning Rate: 3e-3
Minimum Learning Rate: 0
Number of Epochs: 1
Sequence Length: 384
Batch Size: 150
Eval Split Ratio: 0.0025
Gradient Accumulation Steps: 2
Gradient Checkpointing: True
Gradient Clipping: 1.0
Torch Compile: True
Torch Compile Mode: max-autotune-no-cudagraphs
AdamW Betas: (0.9, 0.95)
WSD Warmup Ratio: 0.015
WSD Stable Ratio: 0.685
WSD Decay Ratio: 0.30
DType: float16

Final Eval and Train Loss

Train: 2.404
Val: 2.403

Hardware

GPU: NVIDIA RTX 5060 (used for training)
CPU: AMD Ryzen 5 2600 (used for tokenization)

Benchmark scores

Task	Value
BLiMP	75.94%
ARC Challenge	20.65%
ARC Easy	34.89%
BoolQ	51.80%
HellaSwag	28.39%
PiQA	57.78%
SciQ	59.10%
SWAG	41.60%
Winogrande	49.01%

ArithMark-2.0:

Category	Accuracy
ops = 1	30.08%
ops = 2	35.47%
ops = 3	26.60%
Avg	31.00%

For a comparison with other small language models like this one, go here.

Generation Sample

Prompt : 'Artificial intelligence is'
------------------------------------------------------------
Generated:
 a form of biomedical research that has been fundamentally and intellectually revolutionary in the past decade. The first major advancement in artificial intelligence was the invention of computers, which were based on digital computer science and computational software, and nowadays we’re still working with machines as well as other languages. This is what’s happening in medicine today: this new technology enables us to get more information about how we can better understand human-like behaviour through our own imagination.
Currently, computer scientists have been studying the future of artificial intelligence for nearly 20 years. They are investigating how the world’s people actually look at their bodies and their environment and why they see them and how it works. As a result, they have become increasingly interested in the way we think about the future of the mind and the world around us. Most of these artificial intelligences are not physically active, but are seen in their own right. So,

Use Cases

Educational work and research
Fine-tuning for downstream use
Deployment on edge devices
Or just for fun.

Limitations

Cannot chat, reason, code, or answer questions
Almost always unfactual
No long-context handling

License

Before using, distributing, selling, or modifying this software, you must read the license here.

Inference

#!/usr/bin/env python3

MODEL_DIR = "fromziro/Er-13M"
TOKENIZER_PATH = MODEL_DIR

PROMPT = "Artificial intelligence is"
MAX_NEW_TOKENS = 256
TEMPERATURE = 0.7
TOP_P = 0.95
TOP_K = 30
REPETITION_PENALTY = 1.2
DO_SAMPLE = True

import torch
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizerFast

device = (
    "cuda" if torch.cuda.is_available() else
    "mps" if torch.backends.mps.is_available() else
    "cpu"
)
print(f"Device : {device}")

def load_tokenizer(path_or_repo: str):
    p = Path(path_or_repo)

    if p.exists() and p.is_file() and p.suffix.lower() == ".json":
        tok = PreTrainedTokenizerFast(tokenizer_file=str(p.resolve()))
    else:
        tok = AutoTokenizer.from_pretrained(path_or_repo, use_fast=True)

    if tok.bos_token is None:
        tok.add_special_tokens({"bos_token": "<|bos|>"})
    if tok.eos_token is None:
        tok.add_special_tokens({"eos_token": "<|eos|>"})
    if tok.unk_token is None:
        tok.add_special_tokens({"unk_token": "<|unk|>"})
    if tok.pad_token is None:
        tok.pad_token = tok.eos_token if tok.eos_token is not None else "<|pad|>"

    tok.padding_side = "left"
    return tok

print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f"  Vocab size : {len(tokenizer)}")
print(f"  BOS        : {tokenizer.bos_token!r}")
print(f"  EOS        : {tokenizer.eos_token!r}")
print(f"  PAD        : {tokenizer.pad_token!r}  (id={tokenizer.pad_token_id})")

print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    low_cpu_mem_usage=True,
)

model.eval()
model.to(device)
model.config.use_cache = False
if hasattr(model, "generation_config") and model.generation_config is not None:
    model.generation_config.use_cache = False

total_params = sum(p.numel() for p in model.parameters())
print(f"  Parameters : {total_params:,}")

def generate(
    prompt: str = PROMPT,
    max_new_tokens: int = MAX_NEW_TOKENS,
    temperature: float = TEMPERATURE,
    top_p: float = TOP_P,
    top_k: int = TOP_K,
    repetition_penalty: float = REPETITION_PENALTY,
    do_sample: bool = DO_SAMPLE,
) -> str:
    bos = tokenizer.bos_token or ""
    full_prompt = bos + prompt

    inputs = tokenizer(
        full_prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).to(device)

    inputs.pop("token_type_ids", None)

    gen_kwargs = dict(
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        repetition_penalty=repetition_penalty,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        use_cache=False,
    )

    if do_sample:
        gen_kwargs["temperature"] = temperature
        gen_kwargs["top_p"] = top_p
        gen_kwargs["top_k"] = top_k

    with torch.inference_mode():
        output_ids = model.generate(**inputs, **gen_kwargs)

    prompt_len = inputs["input_ids"].shape[-1]
    new_ids = output_ids[0][prompt_len:]
    return tokenizer.decode(new_ids, skip_special_tokens=True)

if __name__ == "__main__":
    print(f"\nPrompt : {PROMPT!r}")
    print("-" * 60)
    output = generate(PROMPT)
    print("Generated:")
    print(output)

Copyright

Copyright (c) 2026 FromZero  
Copyright (c) 2026 Paul Courneya
Copyright (c) 2026 Jonathan LY

Citation

@misc{syn2.6m,
  title     = {Er-13M: A Small Language Model (13M) Achieving a High ArithMark and BLiMP Score Through Massive Overtraining},
  author    = {FromZero},
  year      = {2026},
  url       = {https://huggingface.co/fromziro/Er-13M}
}

Downloads last month: 80

Safetensors

Model size

12.5M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

fromziro
/

Er-13M