MiqraBERT

A sentence-transformers model finetuned from AlephBERT for detecting parallel passages in the Hebrew Bible. It maps Biblical Hebrew verses to 768-dimensional embeddings where cosine similarity reflects textual parallelism — high scores indicate genuine synoptic parallels, low scores indicate unrelated text.

MiqraBERT derives from Hebrew מִקְרָא (miqra, "scripture").

Model Details

  • Developed by: David M. Smiley, University of Notre Dame
  • Model type: Sentence Transformer (BERT encoder + mean pooling)
  • Language: Biblical Hebrew (vocalized, with niqqud)
  • Base model: AlephBERT (via sentence-transformers-alephbert)
  • Finetuned on: T'OMIM — 1,650 Biblical Hebrew verse pairs (Zenodo)
  • Output: 768 dimensions, cosine similarity
  • Max sequence length: 512 tokens
  • License: Apache 2.0
  • Paper: forthcoming

Usage

Sentence Transformers

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("davidmsmiley/miqrabert")

# 2 Kgs 18:13 and its synoptic parallel Isa 36:1
parallel_a = "וּבְאַרְבַּע עֶשְׂרֵה שָׁנָה לַמֶּלֶךְ חִזְקִיָּהוּ עָלָה סַנְחֵרִיב מֶלֶךְ־אַשּׁוּר עַל כָּל־עָרֵי יְהוּדָה הַבְּצֻרוֺת וַיִּתְפְּשֵׂם"
parallel_b = "וַיְהִי בְּאַרְבַּע עֶשְׂרֵה שָׁנָה לַמֶּלֶךְ חִזְקִיָּהוּ עָלָה סַנְחֵרִיב מֶלֶךְ־אַשּׁוּר עַל־כָּל־עָרֵי יְהוּדָה הַבְּצֻרוֺת וַיִּתְפְּשֵׂם"
unrelated  = "וְהִנֵּה שֶׁבַע שִׁבֳּלִים צְנֻמוֺת דַּקּוֺת שְׁדֻפוֺת קָדִים צֹמְחוֺת אַחֲרֵיהֶם"

embeddings = model.encode([parallel_a, parallel_b, unrelated])
similarities = model.similarity(embeddings, embeddings)
# parallel_a ↔ parallel_b: ~0.99 (near-verbatim parallel)
# parallel_a ↔ unrelated:  ~0.09 (no relationship)

Using Transformers Directly

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("davidmsmiley/miqrabert")
model = AutoModel.from_pretrained("davidmsmiley/miqrabert")

def encode(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    with torch.no_grad():
        output = model(**inputs)
    mask = inputs["attention_mask"].unsqueeze(-1)
    embeddings = (output.last_hidden_state * mask).sum(1) / mask.sum(1)
    return torch.nn.functional.normalize(embeddings, p=2, dim=1)

# 1 Sam 31:6 // 1 Chr 10:6 — death of Saul (synoptic parallel)
emb = encode([
    "וַיָּמָת שָׁאוּל וּשְׁלֹשֶׁת בָּנָיו וְנֹשֵׂא כֵלָיו גַּם כָּל־אֲנָשָׁיו בַּיּוֺם הַהוּא יַחְדָּו",
    "וַיָּמָת שָׁאוּל וּשְׁלֹשֶׁת בָּנָיו וְכָל־בֵּיתוֺ יַחְדָּו מֵתוּ"
])
similarity = torch.nn.functional.cosine_similarity(emb[0], emb[1], dim=0)

Intended Uses

Use for: measuring semantic similarity between Biblical Hebrew verse pairs; identifying candidate parallel passages across the Hebrew Bible; supporting computational research on inner-biblical allusion and textual reuse.

Not designed for: Modern Hebrew, Rabbinic Hebrew, or Aramaic text. Not optimized for poetic parallelism (see Limitations). Outputs continuous similarity scores — not a binary classifier.

Training

Data

T'OMIM contains 825 parallel and 825 non-parallel Biblical Hebrew verse pairs. Parallels include 556 narrative pairs from Chronicles // Samuel-Kings and 269 poetic pairs from published parallelism studies (Berlin, Fokkelman, Kugel, Tsumura). Negatives are random pairs sampled from the full Hebrew Bible.

Procedure

Cosine similarity regression via CosineSimilarityLoss (MSE). Both verses pass through a shared encoder, are mean-pooled to 768-dim embeddings, and compared via cosine similarity against target labels (1.0 = parallel, 0.0 = non-parallel). This checkpoint uses a 70/15/15 train-validation-test split (1,155 / 247 / 248 pairs), selected from seven configurations (50%–90%) as the optimal balance of separation quality and test set size. Stability validated across 10 random seeds (70 models total).

Hyperparameters

  • Epochs: 2
  • Batch size: 16
  • Learning rate: 5e-05 (linear schedule)
  • Optimizer: AdamW
  • Seed: 42
  • Hardware: NVIDIA T4 GPU (~36 seconds)

Framework Versions

  • Sentence Transformers 5.2.0 / Transformers 4.57.3 / PyTorch 2.9.0+cu126

Evaluation

Test Set Performance

Metric Score
Wasserstein Distance 0.772 [0.735, 0.809]
Overlap Coefficient 0.046
F1 (threshold = 0.53) 0.980
Precision / Recall 0.984 / 0.976
Mean cosine sim (parallel) 0.880
Mean cosine sim (non-parallel) 0.108

Wasserstein Distance (WD) measures distributional separation between parallel and non-parallel similarity scores; higher is better. Overlap Coefficient (OVL) measures the proportion of ambiguous space where distributions intersect; lower is better. The unfinetuned AlephBERT baseline achieves WD = 0.276 and OVL = 0.240.

Retrieval (Recall@k)

Each query verse is searched against all 68,125 verse and half-verse vectors in the Hebrew Bible (BHSA corpus). Recall@k measures how often the true parallel appears in the top-k results.

Model Recall@10 (all) Recall@10 (narrative) Recall@10 (poetic)
MiqraBERT-70p 0.728 0.871 0.089
BEREL-70p 0.704 0.831 0.137
DictaLM-70p 0.751 0.914 0.024

MiqraBERT is selected as the primary model for its balance across metrics: strong narrative recall, stable training, and the smallest parameter footprint (~110M vs. 7.25B for DictaLM).

Limitations

  • Narrative focus: Trained primarily on Chronicles // Samuel-Kings synoptic parallels. Recall@10 for poetic parallelism is only 8.9% — a structural limitation of mean-pooled embeddings for texts with little lexical overlap.
  • Biblical Hebrew only: Not evaluated on Modern Hebrew, Rabbinic Hebrew, unvocalized text, or other Semitic languages.
  • Training scope: May underperform on intertextual relationships not represented in training (allusions, type-scenes, formulaic speech).

Citation

Paper forthcoming. In the meantime, please cite the model directly:

@misc{smiley2025miqrabert,
    title  = {MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection},
    author = {Smiley, David M.},
    year   = {2025},
    url    = {https://huggingface.co/davidmsmiley/miqrabert}
}

Upstream Models

@inproceedings{reimers2019sentencebert,
    title     = {Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
    author    = {Reimers, Nils and Gurevych, Iryna},
    booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
    year      = {2019}
}

@article{seker2021alephbert,
    title  = {AlephBERT: A Hebrew Large Pre-Trained Language Model to Start-off Your Hebrew NLP Application With},
    author = {Seker, Amit and Bandel, Elron and Bareket, Dan and Brusilovsky, Idan and Greenfeld, Refael Shaked and Tsarfaty, Reut},
    journal = {arXiv preprint arXiv:2104.04052},
    year   = {2021}
}
Downloads last month
85
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for davidmsmiley/MiqraBERT

Finetuned
(1)
this model

Paper for davidmsmiley/MiqraBERT

Evaluation results