MiqraBERT

A sentence-transformers model finetuned from AlephBERT for detecting parallel passages in the Hebrew Bible. It maps Biblical Hebrew verses to 768-dimensional embeddings where cosine similarity reflects textual parallelism — high scores indicate genuine synoptic parallels, low scores indicate unrelated text.

MiqraBERT derives from Hebrew מִקְרָא (miqra, "scripture").

Model Details

Developed by: David M. Smiley, University of Notre Dame
Model type: Sentence Transformer (BERT encoder + mean pooling)
Language: Biblical Hebrew (vocalized, with niqqud)
Base model: AlephBERT (via sentence-transformers-alephbert)
Finetuned on: T'OMIM — 1,650 Biblical Hebrew verse pairs (Zenodo)
Output: 768 dimensions, cosine similarity
Max sequence length: 512 tokens
License: Apache 2.0
Paper: forthcoming

Usage

Sentence Transformers

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("davidmsmiley/miqrabert")

# 2 Kgs 18:13 and its synoptic parallel Isa 36:1
parallel_a = "וּבְאַרְבַּע עֶשְׂרֵה שָׁנָה לַמֶּלֶךְ חִזְקִיָּהוּ עָלָה סַנְחֵרִיב מֶלֶךְ־אַשּׁוּר עַל כָּל־עָרֵי יְהוּדָה הַבְּצֻרוֺת וַיִּתְפְּשֵׂם"
parallel_b = "וַיְהִי בְּאַרְבַּע עֶשְׂרֵה שָׁנָה לַמֶּלֶךְ חִזְקִיָּהוּ עָלָה סַנְחֵרִיב מֶלֶךְ־אַשּׁוּר עַל־כָּל־עָרֵי יְהוּדָה הַבְּצֻרוֺת וַיִּתְפְּשֵׂם"
unrelated  = "וְהִנֵּה שֶׁבַע שִׁבֳּלִים צְנֻמוֺת דַּקּוֺת שְׁדֻפוֺת קָדִים צֹמְחוֺת אַחֲרֵיהֶם"

embeddings = model.encode([parallel_a, parallel_b, unrelated])
similarities = model.similarity(embeddings, embeddings)
# parallel_a ↔ parallel_b: ~0.99 (near-verbatim parallel)
# parallel_a ↔ unrelated:  ~0.09 (no relationship)

Using Transformers Directly

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("davidmsmiley/miqrabert")
model = AutoModel.from_pretrained("davidmsmiley/miqrabert")

def encode(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    with torch.no_grad():
        output = model(**inputs)
    mask = inputs["attention_mask"].unsqueeze(-1)
    embeddings = (output.last_hidden_state * mask).sum(1) / mask.sum(1)
    return torch.nn.functional.normalize(embeddings, p=2, dim=1)

# 1 Sam 31:6 // 1 Chr 10:6 — death of Saul (synoptic parallel)
emb = encode([
    "וַיָּמָת שָׁאוּל וּשְׁלֹשֶׁת בָּנָיו וְנֹשֵׂא כֵלָיו גַּם כָּל־אֲנָשָׁיו בַּיּוֺם הַהוּא יַחְדָּו",
    "וַיָּמָת שָׁאוּל וּשְׁלֹשֶׁת בָּנָיו וְכָל־בֵּיתוֺ יַחְדָּו מֵתוּ"
])
similarity = torch.nn.functional.cosine_similarity(emb[0], emb[1], dim=0)

Intended Uses

Use for: measuring semantic similarity between Biblical Hebrew verse pairs; identifying candidate parallel passages across the Hebrew Bible; supporting computational research on inner-biblical allusion and textual reuse.

Not designed for: Modern Hebrew, Rabbinic Hebrew, or Aramaic text. Not optimized for poetic parallelism (see Limitations). Outputs continuous similarity scores — not a binary classifier.

Training

Data

T'OMIM contains 825 parallel and 825 non-parallel Biblical Hebrew verse pairs. Parallels include 556 narrative pairs from Chronicles // Samuel-Kings and 269 poetic pairs from published parallelism studies (Berlin, Fokkelman, Kugel, Tsumura). Negatives are random pairs sampled from the full Hebrew Bible.

Procedure

Cosine similarity regression via CosineSimilarityLoss (MSE). Both verses pass through a shared encoder, are mean-pooled to 768-dim embeddings, and compared via cosine similarity against target labels (1.0 = parallel, 0.0 = non-parallel). This checkpoint uses a 70/15/15 train-validation-test split (1,155 / 247 / 248 pairs), selected from seven configurations (50%–90%) as the optimal balance of separation quality and test set size. Stability validated across 10 random seeds (70 models total).

Hyperparameters

Epochs: 2
Batch size: 16
Learning rate: 5e-05 (linear schedule)
Optimizer: AdamW
Seed: 42
Hardware: NVIDIA T4 GPU (~36 seconds)

Framework Versions

Sentence Transformers 5.2.0 / Transformers 4.57.3 / PyTorch 2.9.0+cu126

Evaluation

Test Set Performance

Metric	Score
Wasserstein Distance	0.772 [0.735, 0.809]
Overlap Coefficient	0.046
F1 (threshold = 0.53)	0.980
Precision / Recall	0.984 / 0.976
Mean cosine sim (parallel)	0.880
Mean cosine sim (non-parallel)	0.108

Wasserstein Distance (WD) measures distributional separation between parallel and non-parallel similarity scores; higher is better. Overlap Coefficient (OVL) measures the proportion of ambiguous space where distributions intersect; lower is better. The unfinetuned AlephBERT baseline achieves WD = 0.276 and OVL = 0.240.

Retrieval (Recall@k)

Each query verse is searched against all 68,125 verse and half-verse vectors in the Hebrew Bible (BHSA corpus). Recall@k measures how often the true parallel appears in the top-k results.

Model	Recall@10 (all)	Recall@10 (narrative)	Recall@10 (poetic)
MiqraBERT-70p	0.728	0.871	0.089
BEREL-70p	0.704	0.831	0.137
DictaLM-70p	0.751	0.914	0.024

MiqraBERT is selected as the primary model for its balance across metrics: strong narrative recall, stable training, and the smallest parameter footprint (~110M vs. 7.25B for DictaLM).

Limitations

Narrative focus: Trained primarily on Chronicles // Samuel-Kings synoptic parallels. Recall@10 for poetic parallelism is only 8.9% — a structural limitation of mean-pooled embeddings for texts with little lexical overlap.
Biblical Hebrew only: Not evaluated on Modern Hebrew, Rabbinic Hebrew, unvocalized text, or other Semitic languages.
Training scope: May underperform on intertextual relationships not represented in training (allusions, type-scenes, formulaic speech).

Citation

Paper forthcoming. In the meantime, please cite the model directly:

@misc{smiley2025miqrabert,
    title  = {MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection},
    author = {Smiley, David M.},
    year   = {2025},
    url    = {https://huggingface.co/davidmsmiley/miqrabert}
}

Upstream Models

@inproceedings{reimers2019sentencebert,
    title     = {Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
    author    = {Reimers, Nils and Gurevych, Iryna},
    booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
    year      = {2019}
}

@article{seker2021alephbert,
    title  = {AlephBERT: A Hebrew Large Pre-Trained Language Model to Start-off Your Hebrew NLP Application With},
    author = {Seker, Amit and Bandel, Elron and Bareket, Dan and Brusilovsky, Idan and Greenfeld, Refael Shaked and Tsarfaty, Reut},
    journal = {arXiv preprint arXiv:2104.04052},
    year   = {2021}
}

Downloads last month: 85

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for davidmsmiley/MiqraBERT

Base model

imvladikon/sentence-transformers-alephbert

Finetuned

(1)

this model

Paper for davidmsmiley/MiqraBERT

AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With

Paper • 2104.04052 • Published Apr 8, 2021

Evaluation results

F1 (threshold=0.53) on T'OMIM
self-reported

0.980
Recall@10 (all pairs) on T'OMIM
self-reported

0.728
Recall@10 (narrative) on T'OMIM
self-reported

0.871