Instructions to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="SemplificaAI/EuroLLM-ISO24495-9b-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SemplificaAI/EuroLLM-ISO24495-9b-Instruct")
model = AutoModelForCausalLM.from_pretrained("SemplificaAI/EuroLLM-ISO24495-9b-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SemplificaAI/EuroLLM-ISO24495-9b-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct

SGLang

How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with Docker Model Runner:
```
docker model run hf.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct
```

Access to EuroLLM-ISO24495-9b-Instruct (v0.3)

This model is released under CC-BY-NC-4.0 (non-commercial). The form below helps us understand who is using the model and prioritize improvements. Approval is automatic once the form is submitted.

By submitting this form you confirm that (1) your intended use complies with the CC-BY-NC-4.0 license terms (non-commercial), and (2) you have read the Limitations section of the model card. For commercial use, please contact hf@semplifica.ai.

EuroLLM-ISO24495-9b-Instruct (v0.3)

A fine-tuned EuroLLM-9B-Instruct-2512 specialised in ISO 24495-1 (Plain Language) compliance analysis of legal, administrative and technical texts across six European languages: Italian, English, Portuguese, Spanish, French, German.

Given a document, the model emits a structured XML analysis with: a compliance score (0–100), a binary verdict, a list of violation spans with character-level offsets and corrective suggestions, and a prioritised checklist of corrective actions.

Version: v0.3 — trained on about 27,000 task records (v3b dataset, hybrid: v3 base + Tier 2 targeted augmentation + HF human-curated re-import), with verdict balance per language and a 19 % anti-forgetting mix (EuroBlocks instruct conversations). Previous: v0.2 — trained on v3 (about 23,000 task records), see git tag.

What changed in v0.3

Compared to v0.2 (the previous public release):

Dataset: from v3 to v3b (about +13 % task records)

The v3b dataset extends v3 with two complementary slices targeted at the weaknesses identified during the v0.2 evaluation:

Slice 1 — Tier 2 targeted augmentation (1,485 fresh records)

Generated synthetically (gemini-2.5-flash + gemini-3.5-flash recovery) on five batches addressing specific gaps:

Batch	Records	Target
A1 `missing_structure`	309	Under-represented violation type (only 3.7 % of spans in v3)
A2 `ambiguous_reference`	386	Under-represented (4.8 % → target ≥ 7 %)
A3 `double_negative`	398	Under-represented (5.8 % → target ≥ 7 %)
B span-dense	231	Documents with 6–15 violations to push median span density from 2 → 3+
C hard-negatives	161	Conforme texts that look like non-conforme, to reduce false positives on borderline cases

Slice 2 — Human-curated re-import (1,583 records)

Records built on top of human-curated source documents from 15 public/ proprietary datasets, cleaned and normalised, then partially re-annotated with assistance from gemini-3.5-flash under human review:

Dataset	Records	Language
`wivico_fr`	366	FR
`german4all_de`	216	DE
`porsimples_sent`	198	PT
`gem_cochrane`	186	multi
`simpitiki_it`	170	IT
`easier_es`	145	ES
`admin_it`	107	IT
`med_easi`	66	EN
`vikidia_enfr`	64	EN/FR
`service_public`	26	FR
`plaba_en`	22	EN
`boe_xsum`	8	ES
`eur_lex`	5	multi
`agentpublic_travail`	3	FR
`text_complexity_de`	1	DE

This slice introduces real-world stylistic variety, edge-case clauses, and harder negative examples that pure synthetic generation underproduced.

Methodology

Same test set as v0.2 (200 blind samples, v3 split). Choice intentional: holding the test set fixed isolates the dataset effect. The only changing variable between v0.2 and v0.3 is the training corpus.
Stack training invariant: 8-bit base + LoRA bf16 (r=64, alpha=128) + Liger kernel + paged AdamW 8-bit + 2 epochs + lr 1.5e-4 + 100-step warmup + sequence length 3072. Same hyperparameters as v0.2.

Dual-metric evaluation: why BERTScore was added

The v0.2 model card flagged checklist_rouge_l = 0.27 as below the acceptable threshold (≥ 0.45). After qualitative inspection, we concluded this was largely a measurement artefact, not a quality problem:

ROUGE-L is a lexical metric: it counts the longest common n-gram subsequence between predicted and reference checklist. It penalises paraphrasing heavily.
Plain-language corrective actions admit many equally valid surface forms of the same intent. For example, "Replace archaic legal formulas with direct expressions" and "Simplify legal language to make it more accessible" are semantically equivalent but share almost no overlapping n-grams.
Multilingual amplification: ROUGE works on raw tokens, so semantic equivalents across IT/EN/FR/DE/ES/PT lexical traditions score even lower than within-language paraphrasing.

To capture the semantic dimension of checklist quality we added checklist_bertscore_f1 using microsoft/mdeberta-v3-base:

mdeberta-v3-base is a multilingual masked-language model (≈ 280 MB) covering all six target languages, giving robust cross-lingual embeddings.
BERTScore F1 computes cosine similarity between contextual embeddings of predicted vs reference tokens, then takes the precision-recall harmonic mean. It rewards semantically equivalent paraphrases that ROUGE penalises.
ROUGE-L is still reported for backwards-compatibility with v0.1-base / v0.2 evaluations and for users who want a purely lexical view.

Empirically, the v0.2 model has BERTScore F1 = 0.72 against ROUGE-L = 0.27 — confirming that the corrective items are semantically on-target but lexically free-form. The same gap holds for v0.3 (0.72 vs 0.27).

Headline metric changes (200-sample blind test, v3 split)

Metric	v0.2	v0.3	Δ
`score_mae` (lower is better)	2.74	2.705	-1.3 %
`verdict_f1`	0.9577	0.9793	+2.3 %
`verdict_recall`	0.9444	0.9861	+4.4 %
`false_positive_rate` (lower is better)	0.0156	0.0156	=
`span_f1` (IoU ≥ 0.5)	0.3653	0.3885	+6.4 %
`checklist_rouge_l`	0.2655	0.2663	+0.3 %
`checklist_bertscore_f1`	0.7179	0.7203	+0.3 %

Eight metrics out of eight improve; no metric regresses. The largest gains are on the targets of the Tier 2 augmentation: verdict recall (+4.4 %, helped by the C hard-negatives batch) and span F1 (+6.4 %, helped by the A1/A2/A3 under-represented-types batches and the B span-dense batch). Checklist metrics improve only marginally because the v3b slices were not designed to expand the corrective-action vocabulary.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

REPO = "SemplificaAI/EuroLLM-ISO24495-9b-Instruct"
REVISION = "v0.3"  # or omit for latest

# Recommended: 8-bit loading → about 9 GB VRAM (instead of about 18 GB in bf16)
bnb = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(REPO, revision=REVISION)
model = AutoModelForCausalLM.from_pretrained(
    REPO, revision=REVISION,
    quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16,
)
model.eval()

SYSTEM = (
    "You are an expert in plain language according to ISO 24495-1:2023. "
    "Analyze the provided text and produce: (1) a compliance score 0-100, "
    "(2) parts to improve with specific suggestions, "
    "(3) an ordered checklist of corrective actions. "
    "Reply directly without thinking aloud."
)

text = """The Parties hereby acknowledge, in light of the foregoing premises
which form an integral and substantive part of this Agreement, that the
Confidential Information shall not include..."""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": f"Analyze this text for ISO 24495-1 plain language compliance:\n\n<TEXT>\n{text}\n</TEXT>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=3072, do_sample=False,
                         pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Output format

The model emits a single XML block with four fields. Format identical to v0.2 — see the v0.2 model card for the full schema and the canonical violation_type vocabulary (10 categories).

Examples

Qualitative examples will be added in a follow-up commit. The behaviour on the two reference documents in the v0.2 model card (Italian NDA, English HVAC safety manual) is consistent with v0.2 — the upgrade is most visible on dense documents with many violations and on borderline-conforme cases.

Evaluation

Evaluated on the same 200 blind samples as v0.2, drawn from the v3 held-out test split, stratified by (language × doc_type × difficulty × verdict), never seen during training or validation. Using the same test set makes the v0.2 → v0.3 comparison directly comparable.

Metrics

Metric	Prod threshold	Acceptable threshold	v0.3 result	Status
`score_mae`	≤ 8.0	≤ 12.0	2.705	✅ PROD
`verdict_f1`	≥ 0.88	≥ 0.80	0.9793	✅ PROD
`verdict_precision`	—	—	0.9726	(high)
`verdict_recall`	—	—	0.9861	(very high)
`false_positive_rate`	≤ 0.08	≤ 0.15	0.0156	✅ PROD
`span_f1` (IoU char-level ≥ 0.5)	≥ 0.72	≥ 0.62	0.3885	⚠️ below accept (trend ↑)
`checklist_rouge_l` (legacy)	≥ 0.55	≥ 0.45	0.2663	⚠️ below accept (lexical)
`checklist_bertscore_f1` (mdeberta-v3-base)	≥ 0.78	≥ 0.65	0.7203	✅ accept (semantic)

Interpretation

Strengths

Verdict F1 now 0.979 (vs 0.958 in v0.2): one of the best improvements of the v0.3 cycle. Both precision (0.97) and recall (0.99) are very high — the model both correctly flags non-compliant texts and almost never misses one.
Score calibration further improved: MAE 2.705 on the 0–100 scale, vs 2.74 in v0.2. Already excellent, marginally better.
False positive rate unchanged at 1.6 %: well below the production cap of 8 %. The C hard-negatives batch did not increase the FPR despite the higher verdict recall — a textbook recall-without-precision-loss outcome.
Span F1 improving in the right direction (+6.4 %, 0.37 → 0.39), driven by the A1/A2/A3 under-represented-type augmentation and the B span-dense batch. Still below the acceptable threshold of 0.62: reaching that target will require manual annotation (planned for v1.0).
Robust semantic checklist quality (BERTScore F1 0.72, above the acceptable threshold of 0.65). Stable vs v0.2 because the v3b augmentation was not targeted at the corrective-action vocabulary.

Measured weaknesses

Span F1 0.39 still below ≥ 0.62. The Tier 2 synthetic augmentation pushed the metric in the right direction (+6.4 %) but the remaining gap requires real-world annotated samples that no synthetic pipeline replicates well at this stage.
Checklist ROUGE-L 0.27 unchanged. Lexical metric is not the right tool here — the semantic equivalent (BERTScore F1 0.72) shows the corrective items are on-target. We keep ROUGE-L for back-compatibility but recommend reading BERTScore as the primary checklist quality signal.

Robustness on a newer distribution (v4 test split)

To check that v0.3 does not overfit to the v3 distribution, we built a second 200-sample blind set drawn from the v4 dataset test split. This v4 split contains 87 % records not present in v3 (different sentence-level chunking, additional delibera document type, and the human-curated re-imports from public datasets that v3b training exposed only partially). It is the closest proxy to "out-of-training distribution".

Metric	test v3 (in-distribution)	test v4 (87 % unseen)	test merged (v3 ∪ v4 dedup)	Δ v3 → v4
`score_mae`	2.705	2.86	3.27	+5.7 %
`verdict_f1`	0.9793	0.9789	0.9674	−0.04 %
`verdict_precision`	0.9726	0.9789	0.9674	+0.65 %
`verdict_recall`	0.9861	0.9789	0.9674	−0.73 %
`false_positive_rate`	0.0156	0.0190	0.0278	+0.34 pp
`span_f1`	0.3885	0.3277	0.3303	−15.7 %
`checklist_rouge_l`	0.2663	0.2614	0.2442	−1.8 %
`checklist_bertscore_f1`	0.7203	0.7260	0.7246	+0.8 %

Reading:

The model holds up well on the v4 distribution: all metrics stay within the production thresholds documented above. The merged column is the best single estimate of expected production performance because it samples uniformly from the combined v3 ∪ v4 pool.
The hardest hit is span_f1 (−15.7 %), confirming that span localisation is the weakest dimension on out-of-distribution records (long-tail violation patterns underrepresented in the v3b training corpus).
Binary verdict and semantic checklist quality are stable: verdict_f1 essentially unchanged (−0.04 %); BERTScore F1 marginally up (+0.8 %).

This same evaluation was also run on the previous v0.2 model on the v4 split for comparison: v0.2 keeps an identical verdict_f1 of 0.9789 but pays a +37 % score_mae penalty (3.78 vs 2.86), confirming the v0.3 training added robustness to score calibration even on distributions it has only partially seen.

Intended use, Limitations, Training details, Dataset

See the v0.2 model card (most fields unchanged in v0.3). Differences are limited to the training set composition described above and to the additional BERTScore metric.

License & Attribution

See LICENSE (CC-BY-NC-4.0) and ATTRIBUTION.md (Apache 2.0 base model: utter-project/EuroLLM-9B-Instruct-2512). Identical to v0.2.

Citation

@misc{semplifica_iso24495_9b_v03_2026,
  title  = {EuroLLM-ISO24495-9b-Instruct (v0.3): A Fine-Tuned EuroLLM-9B
            for ISO 24495-1 Plain Language Compliance Analysis in Six EU Languages},
  author = {SemplificaAI},
  year   = {2026},
  url    = {https://huggingface.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct},
  note   = {v0.3},
}

Contact

Commercial use: hf@semplifica.ai
Issues, bugs: Community tab on this HF repository.

Downloads last month: 65

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for SemplificaAI/EuroLLM-ISO24495-9b-Instruct

Base model

utter-project/EuroLLM-9B-2512

Finetuned

utter-project/EuroLLM-9B-Instruct-2512

Finetuned

(1)

this model

Evaluation results

Score MAE (0–100) on semplifica.Language v3 test set (blind, 200 samples)
self-reported

2.705
Verdict F1 (binary) on semplifica.Language v3 test set (blind, 200 samples)
self-reported

0.979
Verdict Precision on semplifica.Language v3 test set (blind, 200 samples)
self-reported

0.973
Verdict Recall on semplifica.Language v3 test set (blind, 200 samples)
self-reported

0.986
False Positive Rate on semplifica.Language v3 test set (blind, 200 samples)
self-reported

0.016
Span F1 (IoU ≥ 0.5) on semplifica.Language v3 test set (blind, 200 samples)
self-reported

0.389
Checklist ROUGE-L on semplifica.Language v3 test set (blind, 200 samples)
self-reported

0.266
Checklist BERTScore F1 (mdeberta-v3-base) on semplifica.Language v3 test set (blind, 200 samples)
self-reported

0.720