Azerbaijani Text Quality Classifier

Regression model that scores the quality of Azerbaijani web text on a continuous 0-3 scale. Built to filter a raw web corpus (OSCAR-derived) before language-model pretraining.

Base model: jhu-clsp/mmBERT-base
Task: regression, single output (~0..3). Higher = cleaner text.
Max length: 4096 tokens

Score scale

3 — clean, coherent Azerbaijani prose
2 — substantial good prose mixed with junk (menus, footers, ads)
1 — mostly junk, little recoverable prose
0 — pure junk: navigation pages, spam, machine translation, non-Azerbaijani text

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tok = AutoTokenizer.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier")
model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier")
model.eval()

text = "..."
enc = tok(text, truncation=True, max_length=4096, return_tensors="pt")
with torch.no_grad():
    score = model(**enc).logits.squeeze().item()
print(score)

Limitations

Training labels were generated by an LLM (Mistral-Small-24B), not by humans. Reported validation metrics (val-MSE ~0.14, rounded accuracy ~0.83) measure agreement with the LLM labels, not agreement with human judgement — the latter has not yet been measured against a human-annotated test set. Use with this caveat in mind.

Downloads last month: 38

Safetensors

Model size

0.3B params

Tensor type

BF16

Model tree for LocalDoc/azerbaijani-text-quality-classifier

Base model

jhu-clsp/mmBERT-base

Finetuned

(94)

this model

LocalDoc
/

azerbaijani-text-quality-classifier

Azerbaijani Text Quality Classifier

Score scale

Usage

Limitations

Model tree for LocalDoc/azerbaijani-text-quality-classifier

Dataset used to train LocalDoc/azerbaijani-text-quality-classifier