Azerbaijani Text Quality Classifier

Regression model that scores the quality of Azerbaijani web text on a continuous 0-3 scale. Built to filter a raw web corpus (OSCAR-derived) before language-model pretraining.

  • Base model: jhu-clsp/mmBERT-base
  • Task: regression, single output (~0..3). Higher = cleaner text.
  • Max length: 4096 tokens

Score scale

  • 3 โ€” clean, coherent Azerbaijani prose
  • 2 โ€” substantial good prose mixed with junk (menus, footers, ads)
  • 1 โ€” mostly junk, little recoverable prose
  • 0 โ€” pure junk: navigation pages, spam, machine translation, non-Azerbaijani text

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tok = AutoTokenizer.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier")
model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier")
model.eval()

text = "..."
enc = tok(text, truncation=True, max_length=4096, return_tensors="pt")
with torch.no_grad():
    score = model(**enc).logits.squeeze().item()
print(score)

Limitations

Training labels were generated by an LLM (Mistral-Small-24B), not by humans. Reported validation metrics (val-MSE ~0.14, rounded accuracy ~0.83) measure agreement with the LLM labels, not agreement with human judgement โ€” the latter has not yet been measured against a human-annotated test set. Use with this caveat in mind.

Downloads last month
38
Safetensors
Model size
0.3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for LocalDoc/azerbaijani-text-quality-classifier

Finetuned
(94)
this model

Dataset used to train LocalDoc/azerbaijani-text-quality-classifier