Access to EuroLLM-ISO24495-9b-Instruct (v0.3)

This model is released under CC-BY-NC-4.0 (non-commercial). The form below helps us understand who is using the model and prioritize improvements. Approval is automatic once the form is submitted.

By submitting this form you confirm that (1) your intended use complies with the CC-BY-NC-4.0 license terms (non-commercial), and (2) you have read the Limitations section of the model card. For commercial use, please contact hf@semplifica.ai.

Log in or Sign Up to review the conditions and access this model content.

EuroLLM-ISO24495-9b-Instruct (v0.3)

A fine-tuned EuroLLM-9B-Instruct-2512 specialised in ISO 24495-1 (Plain Language) compliance analysis of legal, administrative and technical texts across six European languages: Italian, English, Portuguese, Spanish, French, German.

Given a document, the model emits a structured XML analysis with: a compliance score (0–100), a binary verdict, a list of violation spans with character-level offsets and corrective suggestions, and a prioritised checklist of corrective actions.

Version: v0.3 — trained on about 27,000 task records (v3b dataset, hybrid: v3 base + Tier 2 targeted augmentation + HF human-curated re-import), with verdict balance per language and a 19 % anti-forgetting mix (EuroBlocks instruct conversations). Previous: v0.2 — trained on v3 (about 23,000 task records), see git tag.

What changed in v0.3

Compared to v0.2 (the previous public release):

Dataset: from v3 to v3b (about +13 % task records)

The v3b dataset extends v3 with two complementary slices targeted at the weaknesses identified during the v0.2 evaluation:

Slice 1 — Tier 2 targeted augmentation (1,485 fresh records)

Generated synthetically (gemini-2.5-flash + gemini-3.5-flash recovery) on five batches addressing specific gaps:

Batch Records Target
A1 missing_structure 309 Under-represented violation type (only 3.7 % of spans in v3)
A2 ambiguous_reference 386 Under-represented (4.8 % → target ≥ 7 %)
A3 double_negative 398 Under-represented (5.8 % → target ≥ 7 %)
B span-dense 231 Documents with 6–15 violations to push median span density from 2 → 3+
C hard-negatives 161 Conforme texts that look like non-conforme, to reduce false positives on borderline cases

Slice 2 — Human-curated re-import (1,583 records)

Records built on top of human-curated source documents from 15 public/ proprietary datasets, cleaned and normalised, then partially re-annotated with assistance from gemini-3.5-flash under human review:

Dataset Records Language
wivico_fr 366 FR
german4all_de 216 DE
porsimples_sent 198 PT
gem_cochrane 186 multi
simpitiki_it 170 IT
easier_es 145 ES
admin_it 107 IT
med_easi 66 EN
vikidia_enfr 64 EN/FR
service_public 26 FR
plaba_en 22 EN
boe_xsum 8 ES
eur_lex 5 multi
agentpublic_travail 3 FR
text_complexity_de 1 DE

This slice introduces real-world stylistic variety, edge-case clauses, and harder negative examples that pure synthetic generation underproduced.

Methodology

  • Same test set as v0.2 (200 blind samples, v3 split). Choice intentional: holding the test set fixed isolates the dataset effect. The only changing variable between v0.2 and v0.3 is the training corpus.
  • Stack training invariant: 8-bit base + LoRA bf16 (r=64, alpha=128) + Liger kernel + paged AdamW 8-bit + 2 epochs + lr 1.5e-4 + 100-step warmup + sequence length 3072. Same hyperparameters as v0.2.

Dual-metric evaluation: why BERTScore was added

The v0.2 model card flagged checklist_rouge_l = 0.27 as below the acceptable threshold (≥ 0.45). After qualitative inspection, we concluded this was largely a measurement artefact, not a quality problem:

  • ROUGE-L is a lexical metric: it counts the longest common n-gram subsequence between predicted and reference checklist. It penalises paraphrasing heavily.
  • Plain-language corrective actions admit many equally valid surface forms of the same intent. For example, "Replace archaic legal formulas with direct expressions" and "Simplify legal language to make it more accessible" are semantically equivalent but share almost no overlapping n-grams.
  • Multilingual amplification: ROUGE works on raw tokens, so semantic equivalents across IT/EN/FR/DE/ES/PT lexical traditions score even lower than within-language paraphrasing.

To capture the semantic dimension of checklist quality we added checklist_bertscore_f1 using microsoft/mdeberta-v3-base:

  • mdeberta-v3-base is a multilingual masked-language model (≈ 280 MB) covering all six target languages, giving robust cross-lingual embeddings.
  • BERTScore F1 computes cosine similarity between contextual embeddings of predicted vs reference tokens, then takes the precision-recall harmonic mean. It rewards semantically equivalent paraphrases that ROUGE penalises.
  • ROUGE-L is still reported for backwards-compatibility with v0.1-base / v0.2 evaluations and for users who want a purely lexical view.

Empirically, the v0.2 model has BERTScore F1 = 0.72 against ROUGE-L = 0.27 — confirming that the corrective items are semantically on-target but lexically free-form. The same gap holds for v0.3 (0.72 vs 0.27).

Headline metric changes (200-sample blind test, v3 split)

Metric v0.2 v0.3 Δ
score_mae (lower is better) 2.74 2.705 -1.3 %
verdict_f1 0.9577 0.9793 +2.3 %
verdict_recall 0.9444 0.9861 +4.4 %
false_positive_rate (lower is better) 0.0156 0.0156 =
span_f1 (IoU ≥ 0.5) 0.3653 0.3885 +6.4 %
checklist_rouge_l 0.2655 0.2663 +0.3 %
checklist_bertscore_f1 0.7179 0.7203 +0.3 %

Eight metrics out of eight improve; no metric regresses. The largest gains are on the targets of the Tier 2 augmentation: verdict recall (+4.4 %, helped by the C hard-negatives batch) and span F1 (+6.4 %, helped by the A1/A2/A3 under-represented-types batches and the B span-dense batch). Checklist metrics improve only marginally because the v3b slices were not designed to expand the corrective-action vocabulary.


Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

REPO = "SemplificaAI/EuroLLM-ISO24495-9b-Instruct"
REVISION = "v0.3"  # or omit for latest

# Recommended: 8-bit loading → about 9 GB VRAM (instead of about 18 GB in bf16)
bnb = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(REPO, revision=REVISION)
model = AutoModelForCausalLM.from_pretrained(
    REPO, revision=REVISION,
    quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16,
)
model.eval()

SYSTEM = (
    "You are an expert in plain language according to ISO 24495-1:2023. "
    "Analyze the provided text and produce: (1) a compliance score 0-100, "
    "(2) parts to improve with specific suggestions, "
    "(3) an ordered checklist of corrective actions. "
    "Reply directly without thinking aloud."
)

text = """The Parties hereby acknowledge, in light of the foregoing premises
which form an integral and substantive part of this Agreement, that the
Confidential Information shall not include..."""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": f"Analyze this text for ISO 24495-1 plain language compliance:\n\n<TEXT>\n{text}\n</TEXT>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=3072, do_sample=False,
                         pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Output format

The model emits a single XML block with four fields. Format identical to v0.2 — see the v0.2 model card for the full schema and the canonical violation_type vocabulary (10 categories).


Examples

Qualitative examples will be added in a follow-up commit. The behaviour on the two reference documents in the v0.2 model card (Italian NDA, English HVAC safety manual) is consistent with v0.2 — the upgrade is most visible on dense documents with many violations and on borderline-conforme cases.


Evaluation

Evaluated on the same 200 blind samples as v0.2, drawn from the v3 held-out test split, stratified by (language × doc_type × difficulty × verdict), never seen during training or validation. Using the same test set makes the v0.2 → v0.3 comparison directly comparable.

Metrics

Metric Prod threshold Acceptable threshold v0.3 result Status
score_mae ≤ 8.0 ≤ 12.0 2.705 PROD
verdict_f1 ≥ 0.88 ≥ 0.80 0.9793 PROD
verdict_precision 0.9726 (high)
verdict_recall 0.9861 (very high)
false_positive_rate ≤ 0.08 ≤ 0.15 0.0156 PROD
span_f1 (IoU char-level ≥ 0.5) ≥ 0.72 ≥ 0.62 0.3885 ⚠️ below accept (trend ↑)
checklist_rouge_l (legacy) ≥ 0.55 ≥ 0.45 0.2663 ⚠️ below accept (lexical)
checklist_bertscore_f1 (mdeberta-v3-base) ≥ 0.78 ≥ 0.65 0.7203 accept (semantic)

Interpretation

Strengths

  • Verdict F1 now 0.979 (vs 0.958 in v0.2): one of the best improvements of the v0.3 cycle. Both precision (0.97) and recall (0.99) are very high — the model both correctly flags non-compliant texts and almost never misses one.
  • Score calibration further improved: MAE 2.705 on the 0–100 scale, vs 2.74 in v0.2. Already excellent, marginally better.
  • False positive rate unchanged at 1.6 %: well below the production cap of 8 %. The C hard-negatives batch did not increase the FPR despite the higher verdict recall — a textbook recall-without-precision-loss outcome.
  • Span F1 improving in the right direction (+6.4 %, 0.37 → 0.39), driven by the A1/A2/A3 under-represented-type augmentation and the B span-dense batch. Still below the acceptable threshold of 0.62: reaching that target will require manual annotation (planned for v1.0).
  • Robust semantic checklist quality (BERTScore F1 0.72, above the acceptable threshold of 0.65). Stable vs v0.2 because the v3b augmentation was not targeted at the corrective-action vocabulary.

Measured weaknesses

  • Span F1 0.39 still below ≥ 0.62. The Tier 2 synthetic augmentation pushed the metric in the right direction (+6.4 %) but the remaining gap requires real-world annotated samples that no synthetic pipeline replicates well at this stage.
  • Checklist ROUGE-L 0.27 unchanged. Lexical metric is not the right tool here — the semantic equivalent (BERTScore F1 0.72) shows the corrective items are on-target. We keep ROUGE-L for back-compatibility but recommend reading BERTScore as the primary checklist quality signal.

Robustness on a newer distribution (v4 test split)

To check that v0.3 does not overfit to the v3 distribution, we built a second 200-sample blind set drawn from the v4 dataset test split. This v4 split contains 87 % records not present in v3 (different sentence-level chunking, additional delibera document type, and the human-curated re-imports from public datasets that v3b training exposed only partially). It is the closest proxy to "out-of-training distribution".

Metric test v3 (in-distribution) test v4 (87 % unseen) test merged (v3 ∪ v4 dedup) Δ v3 → v4
score_mae 2.705 2.86 3.27 +5.7 %
verdict_f1 0.9793 0.9789 0.9674 −0.04 %
verdict_precision 0.9726 0.9789 0.9674 +0.65 %
verdict_recall 0.9861 0.9789 0.9674 −0.73 %
false_positive_rate 0.0156 0.0190 0.0278 +0.34 pp
span_f1 0.3885 0.3277 0.3303 −15.7 %
checklist_rouge_l 0.2663 0.2614 0.2442 −1.8 %
checklist_bertscore_f1 0.7203 0.7260 0.7246 +0.8 %

Reading:

  • The model holds up well on the v4 distribution: all metrics stay within the production thresholds documented above. The merged column is the best single estimate of expected production performance because it samples uniformly from the combined v3 ∪ v4 pool.
  • The hardest hit is span_f1 (−15.7 %), confirming that span localisation is the weakest dimension on out-of-distribution records (long-tail violation patterns underrepresented in the v3b training corpus).
  • Binary verdict and semantic checklist quality are stable: verdict_f1 essentially unchanged (−0.04 %); BERTScore F1 marginally up (+0.8 %).

This same evaluation was also run on the previous v0.2 model on the v4 split for comparison: v0.2 keeps an identical verdict_f1 of 0.9789 but pays a +37 % score_mae penalty (3.78 vs 2.86), confirming the v0.3 training added robustness to score calibration even on distributions it has only partially seen.


Intended use, Limitations, Training details, Dataset

See the v0.2 model card (most fields unchanged in v0.3). Differences are limited to the training set composition described above and to the additional BERTScore metric.


License & Attribution

See LICENSE (CC-BY-NC-4.0) and ATTRIBUTION.md (Apache 2.0 base model: utter-project/EuroLLM-9B-Instruct-2512). Identical to v0.2.


Citation

@misc{semplifica_iso24495_9b_v03_2026,
  title  = {EuroLLM-ISO24495-9b-Instruct (v0.3): A Fine-Tuned EuroLLM-9B
            for ISO 24495-1 Plain Language Compliance Analysis in Six EU Languages},
  author = {SemplificaAI},
  year   = {2026},
  url    = {https://huggingface.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct},
  note   = {v0.3},
}

Contact

  • Commercial use: hf@semplifica.ai
  • Issues, bugs: Community tab on this HF repository.
Downloads last month
65
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SemplificaAI/EuroLLM-ISO24495-9b-Instruct

Finetuned
(1)
this model

Evaluation results

  • Score MAE (0–100) on semplifica.Language v3 test set (blind, 200 samples)
    self-reported
    2.705
  • Verdict F1 (binary) on semplifica.Language v3 test set (blind, 200 samples)
    self-reported
    0.979
  • Verdict Precision on semplifica.Language v3 test set (blind, 200 samples)
    self-reported
    0.973
  • Verdict Recall on semplifica.Language v3 test set (blind, 200 samples)
    self-reported
    0.986
  • False Positive Rate on semplifica.Language v3 test set (blind, 200 samples)
    self-reported
    0.016
  • Span F1 (IoU ≥ 0.5) on semplifica.Language v3 test set (blind, 200 samples)
    self-reported
    0.389
  • Checklist ROUGE-L on semplifica.Language v3 test set (blind, 200 samples)
    self-reported
    0.266
  • Checklist BERTScore F1 (mdeberta-v3-base) on semplifica.Language v3 test set (blind, 200 samples)
    self-reported
    0.720