Instructions to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SemplificaAI/EuroLLM-ISO24495-9b-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("SemplificaAI/EuroLLM-ISO24495-9b-Instruct") model = AutoModelForCausalLM.from_pretrained("SemplificaAI/EuroLLM-ISO24495-9b-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct
- SGLang
How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with Docker Model Runner:
docker model run hf.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct
Access to EuroLLM-ISO24495-9b-Instruct (v0.3)
This model is released under CC-BY-NC-4.0 (non-commercial). The form below helps us understand who is using the model and prioritize improvements. Approval is automatic once the form is submitted.
By submitting this form you confirm that (1) your intended use complies with the CC-BY-NC-4.0 license terms (non-commercial), and (2) you have read the Limitations section of the model card. For commercial use, please contact hf@semplifica.ai.
Log in or Sign Up to review the conditions and access this model content.
EuroLLM-ISO24495-9b-Instruct (v0.3)
A fine-tuned EuroLLM-9B-Instruct-2512 specialised in ISO 24495-1 (Plain Language) compliance analysis of legal, administrative and technical texts across six European languages: Italian, English, Portuguese, Spanish, French, German.
Given a document, the model emits a structured XML analysis with: a compliance score (0–100), a binary verdict, a list of violation spans with character-level offsets and corrective suggestions, and a prioritised checklist of corrective actions.
Version:
v0.3— trained on about 27,000 task records (v3b dataset, hybrid: v3 base + Tier 2 targeted augmentation + HF human-curated re-import), with verdict balance per language and a 19 % anti-forgetting mix (EuroBlocks instruct conversations). Previous:v0.2— trained on v3 (about 23,000 task records), see git tag.
What changed in v0.3
Compared to v0.2 (the previous public release):
Dataset: from v3 to v3b (about +13 % task records)
The v3b dataset extends v3 with two complementary slices targeted at the weaknesses identified during the v0.2 evaluation:
Slice 1 — Tier 2 targeted augmentation (1,485 fresh records)
Generated synthetically (gemini-2.5-flash + gemini-3.5-flash recovery)
on five batches addressing specific gaps:
| Batch | Records | Target |
|---|---|---|
A1 missing_structure |
309 | Under-represented violation type (only 3.7 % of spans in v3) |
A2 ambiguous_reference |
386 | Under-represented (4.8 % → target ≥ 7 %) |
A3 double_negative |
398 | Under-represented (5.8 % → target ≥ 7 %) |
| B span-dense | 231 | Documents with 6–15 violations to push median span density from 2 → 3+ |
| C hard-negatives | 161 | Conforme texts that look like non-conforme, to reduce false positives on borderline cases |
Slice 2 — Human-curated re-import (1,583 records)
Records built on top of human-curated source documents from 15 public/
proprietary datasets, cleaned and normalised, then partially re-annotated
with assistance from gemini-3.5-flash under human review:
| Dataset | Records | Language |
|---|---|---|
wivico_fr |
366 | FR |
german4all_de |
216 | DE |
porsimples_sent |
198 | PT |
gem_cochrane |
186 | multi |
simpitiki_it |
170 | IT |
easier_es |
145 | ES |
admin_it |
107 | IT |
med_easi |
66 | EN |
vikidia_enfr |
64 | EN/FR |
service_public |
26 | FR |
plaba_en |
22 | EN |
boe_xsum |
8 | ES |
eur_lex |
5 | multi |
agentpublic_travail |
3 | FR |
text_complexity_de |
1 | DE |
This slice introduces real-world stylistic variety, edge-case clauses, and harder negative examples that pure synthetic generation underproduced.
Methodology
- Same test set as v0.2 (200 blind samples, v3 split). Choice intentional: holding the test set fixed isolates the dataset effect. The only changing variable between v0.2 and v0.3 is the training corpus.
- Stack training invariant: 8-bit base + LoRA bf16 (r=64, alpha=128) + Liger kernel + paged AdamW 8-bit + 2 epochs + lr 1.5e-4 + 100-step warmup + sequence length 3072. Same hyperparameters as v0.2.
Dual-metric evaluation: why BERTScore was added
The v0.2 model card flagged checklist_rouge_l = 0.27 as below the
acceptable threshold (≥ 0.45). After qualitative inspection, we
concluded this was largely a measurement artefact, not a quality
problem:
- ROUGE-L is a lexical metric: it counts the longest common n-gram subsequence between predicted and reference checklist. It penalises paraphrasing heavily.
- Plain-language corrective actions admit many equally valid surface forms of the same intent. For example, "Replace archaic legal formulas with direct expressions" and "Simplify legal language to make it more accessible" are semantically equivalent but share almost no overlapping n-grams.
- Multilingual amplification: ROUGE works on raw tokens, so semantic equivalents across IT/EN/FR/DE/ES/PT lexical traditions score even lower than within-language paraphrasing.
To capture the semantic dimension of checklist quality we added
checklist_bertscore_f1 using microsoft/mdeberta-v3-base:
- mdeberta-v3-base is a multilingual masked-language model (≈ 280 MB) covering all six target languages, giving robust cross-lingual embeddings.
- BERTScore F1 computes cosine similarity between contextual embeddings of predicted vs reference tokens, then takes the precision-recall harmonic mean. It rewards semantically equivalent paraphrases that ROUGE penalises.
- ROUGE-L is still reported for backwards-compatibility with v0.1-base / v0.2 evaluations and for users who want a purely lexical view.
Empirically, the v0.2 model has BERTScore F1 = 0.72 against ROUGE-L = 0.27 — confirming that the corrective items are semantically on-target but lexically free-form. The same gap holds for v0.3 (0.72 vs 0.27).
Headline metric changes (200-sample blind test, v3 split)
| Metric | v0.2 | v0.3 | Δ |
|---|---|---|---|
score_mae (lower is better) |
2.74 | 2.705 | -1.3 % |
verdict_f1 |
0.9577 | 0.9793 | +2.3 % |
verdict_recall |
0.9444 | 0.9861 | +4.4 % |
false_positive_rate (lower is better) |
0.0156 | 0.0156 | = |
span_f1 (IoU ≥ 0.5) |
0.3653 | 0.3885 | +6.4 % |
checklist_rouge_l |
0.2655 | 0.2663 | +0.3 % |
checklist_bertscore_f1 |
0.7179 | 0.7203 | +0.3 % |
Eight metrics out of eight improve; no metric regresses. The largest gains are on the targets of the Tier 2 augmentation: verdict recall (+4.4 %, helped by the C hard-negatives batch) and span F1 (+6.4 %, helped by the A1/A2/A3 under-represented-types batches and the B span-dense batch). Checklist metrics improve only marginally because the v3b slices were not designed to expand the corrective-action vocabulary.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
REPO = "SemplificaAI/EuroLLM-ISO24495-9b-Instruct"
REVISION = "v0.3" # or omit for latest
# Recommended: 8-bit loading → about 9 GB VRAM (instead of about 18 GB in bf16)
bnb = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(REPO, revision=REVISION)
model = AutoModelForCausalLM.from_pretrained(
REPO, revision=REVISION,
quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16,
)
model.eval()
SYSTEM = (
"You are an expert in plain language according to ISO 24495-1:2023. "
"Analyze the provided text and produce: (1) a compliance score 0-100, "
"(2) parts to improve with specific suggestions, "
"(3) an ordered checklist of corrective actions. "
"Reply directly without thinking aloud."
)
text = """The Parties hereby acknowledge, in light of the foregoing premises
which form an integral and substantive part of this Agreement, that the
Confidential Information shall not include..."""
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": f"Analyze this text for ISO 24495-1 plain language compliance:\n\n<TEXT>\n{text}\n</TEXT>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=3072, do_sample=False,
pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Output format
The model emits a single XML block with four fields. Format identical to v0.2 — see the v0.2 model card for the full schema and the canonical violation_type vocabulary (10 categories).
Examples
Qualitative examples will be added in a follow-up commit. The behaviour on the two reference documents in the v0.2 model card (Italian NDA, English HVAC safety manual) is consistent with v0.2 — the upgrade is most visible on dense documents with many violations and on borderline-conforme cases.
Evaluation
Evaluated on the same 200 blind samples as v0.2, drawn from the v3
held-out test split, stratified by (language × doc_type × difficulty × verdict), never seen during training or validation. Using the same test
set makes the v0.2 → v0.3 comparison directly comparable.
Metrics
| Metric | Prod threshold | Acceptable threshold | v0.3 result | Status |
|---|---|---|---|---|
score_mae |
≤ 8.0 | ≤ 12.0 | 2.705 | ✅ PROD |
verdict_f1 |
≥ 0.88 | ≥ 0.80 | 0.9793 | ✅ PROD |
verdict_precision |
— | — | 0.9726 | (high) |
verdict_recall |
— | — | 0.9861 | (very high) |
false_positive_rate |
≤ 0.08 | ≤ 0.15 | 0.0156 | ✅ PROD |
span_f1 (IoU char-level ≥ 0.5) |
≥ 0.72 | ≥ 0.62 | 0.3885 | ⚠️ below accept (trend ↑) |
checklist_rouge_l (legacy) |
≥ 0.55 | ≥ 0.45 | 0.2663 | ⚠️ below accept (lexical) |
checklist_bertscore_f1 (mdeberta-v3-base) |
≥ 0.78 | ≥ 0.65 | 0.7203 | ✅ accept (semantic) |
Interpretation
Strengths
- Verdict F1 now 0.979 (vs 0.958 in v0.2): one of the best improvements of the v0.3 cycle. Both precision (0.97) and recall (0.99) are very high — the model both correctly flags non-compliant texts and almost never misses one.
- Score calibration further improved: MAE 2.705 on the 0–100 scale, vs 2.74 in v0.2. Already excellent, marginally better.
- False positive rate unchanged at 1.6 %: well below the production cap of 8 %. The C hard-negatives batch did not increase the FPR despite the higher verdict recall — a textbook recall-without-precision-loss outcome.
- Span F1 improving in the right direction (+6.4 %, 0.37 → 0.39), driven by the A1/A2/A3 under-represented-type augmentation and the B span-dense batch. Still below the acceptable threshold of 0.62: reaching that target will require manual annotation (planned for v1.0).
- Robust semantic checklist quality (BERTScore F1 0.72, above the acceptable threshold of 0.65). Stable vs v0.2 because the v3b augmentation was not targeted at the corrective-action vocabulary.
Measured weaknesses
- Span F1 0.39 still below
≥ 0.62. The Tier 2 synthetic augmentation pushed the metric in the right direction (+6.4 %) but the remaining gap requires real-world annotated samples that no synthetic pipeline replicates well at this stage. - Checklist ROUGE-L 0.27 unchanged. Lexical metric is not the right tool here — the semantic equivalent (BERTScore F1 0.72) shows the corrective items are on-target. We keep ROUGE-L for back-compatibility but recommend reading BERTScore as the primary checklist quality signal.
Robustness on a newer distribution (v4 test split)
To check that v0.3 does not overfit to the v3 distribution, we built a
second 200-sample blind set drawn from the v4 dataset test split.
This v4 split contains 87 % records not present in v3 (different
sentence-level chunking, additional delibera document type, and the
human-curated re-imports from public datasets that v3b training
exposed only partially). It is the closest proxy to "out-of-training
distribution".
| Metric | test v3 (in-distribution) | test v4 (87 % unseen) | test merged (v3 ∪ v4 dedup) | Δ v3 → v4 |
|---|---|---|---|---|
score_mae |
2.705 | 2.86 | 3.27 | +5.7 % |
verdict_f1 |
0.9793 | 0.9789 | 0.9674 | −0.04 % |
verdict_precision |
0.9726 | 0.9789 | 0.9674 | +0.65 % |
verdict_recall |
0.9861 | 0.9789 | 0.9674 | −0.73 % |
false_positive_rate |
0.0156 | 0.0190 | 0.0278 | +0.34 pp |
span_f1 |
0.3885 | 0.3277 | 0.3303 | −15.7 % |
checklist_rouge_l |
0.2663 | 0.2614 | 0.2442 | −1.8 % |
checklist_bertscore_f1 |
0.7203 | 0.7260 | 0.7246 | +0.8 % |
Reading:
- The model holds up well on the v4 distribution: all metrics
stay within the production thresholds documented above. The
mergedcolumn is the best single estimate of expected production performance because it samples uniformly from the combined v3 ∪ v4 pool. - The hardest hit is
span_f1(−15.7 %), confirming that span localisation is the weakest dimension on out-of-distribution records (long-tail violation patterns underrepresented in the v3b training corpus). - Binary verdict and semantic checklist quality are stable:
verdict_f1essentially unchanged (−0.04 %);BERTScore F1marginally up (+0.8 %).
This same evaluation was also run on the previous v0.2 model on
the v4 split for comparison: v0.2 keeps an identical verdict_f1 of
0.9789 but pays a +37 % score_mae penalty (3.78 vs 2.86), confirming
the v0.3 training added robustness to score calibration even on
distributions it has only partially seen.
Intended use, Limitations, Training details, Dataset
See the v0.2 model card (most fields unchanged in v0.3). Differences are limited to the training set composition described above and to the additional BERTScore metric.
License & Attribution
See LICENSE (CC-BY-NC-4.0) and ATTRIBUTION.md (Apache 2.0 base model:
utter-project/EuroLLM-9B-Instruct-2512). Identical to v0.2.
Citation
@misc{semplifica_iso24495_9b_v03_2026,
title = {EuroLLM-ISO24495-9b-Instruct (v0.3): A Fine-Tuned EuroLLM-9B
for ISO 24495-1 Plain Language Compliance Analysis in Six EU Languages},
author = {SemplificaAI},
year = {2026},
url = {https://huggingface.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct},
note = {v0.3},
}
Contact
- Commercial use: hf@semplifica.ai
- Issues, bugs: Community tab on this HF repository.
- Downloads last month
- 65
Model tree for SemplificaAI/EuroLLM-ISO24495-9b-Instruct
Base model
utter-project/EuroLLM-9B-2512Evaluation results
- Score MAE (0–100) on semplifica.Language v3 test set (blind, 200 samples)self-reported2.705
- Verdict F1 (binary) on semplifica.Language v3 test set (blind, 200 samples)self-reported0.979
- Verdict Precision on semplifica.Language v3 test set (blind, 200 samples)self-reported0.973
- Verdict Recall on semplifica.Language v3 test set (blind, 200 samples)self-reported0.986
- False Positive Rate on semplifica.Language v3 test set (blind, 200 samples)self-reported0.016
- Span F1 (IoU ≥ 0.5) on semplifica.Language v3 test set (blind, 200 samples)self-reported0.389
- Checklist ROUGE-L on semplifica.Language v3 test set (blind, 200 samples)self-reported0.266
- Checklist BERTScore F1 (mdeberta-v3-base) on semplifica.Language v3 test set (blind, 200 samples)self-reported0.720