MiqraBERT
A sentence-transformers model finetuned from AlephBERT for detecting parallel passages in the Hebrew Bible. It maps Biblical Hebrew verses to 768-dimensional embeddings where cosine similarity reflects textual parallelism — high scores indicate genuine synoptic parallels, low scores indicate unrelated text.
MiqraBERT derives from Hebrew מִקְרָא (miqra, "scripture").
Model Details
- Developed by: David M. Smiley, University of Notre Dame
- Model type: Sentence Transformer (BERT encoder + mean pooling)
- Language: Biblical Hebrew (vocalized, with niqqud)
- Base model: AlephBERT (via sentence-transformers-alephbert)
- Finetuned on: T'OMIM — 1,650 Biblical Hebrew verse pairs (Zenodo)
- Output: 768 dimensions, cosine similarity
- Max sequence length: 512 tokens
- License: Apache 2.0
- Paper: forthcoming
Usage
Sentence Transformers
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("davidmsmiley/miqrabert")
# 2 Kgs 18:13 and its synoptic parallel Isa 36:1
parallel_a = "וּבְאַרְבַּע עֶשְׂרֵה שָׁנָה לַמֶּלֶךְ חִזְקִיָּהוּ עָלָה סַנְחֵרִיב מֶלֶךְ־אַשּׁוּר עַל כָּל־עָרֵי יְהוּדָה הַבְּצֻרוֺת וַיִּתְפְּשֵׂם"
parallel_b = "וַיְהִי בְּאַרְבַּע עֶשְׂרֵה שָׁנָה לַמֶּלֶךְ חִזְקִיָּהוּ עָלָה סַנְחֵרִיב מֶלֶךְ־אַשּׁוּר עַל־כָּל־עָרֵי יְהוּדָה הַבְּצֻרוֺת וַיִּתְפְּשֵׂם"
unrelated = "וְהִנֵּה שֶׁבַע שִׁבֳּלִים צְנֻמוֺת דַּקּוֺת שְׁדֻפוֺת קָדִים צֹמְחוֺת אַחֲרֵיהֶם"
embeddings = model.encode([parallel_a, parallel_b, unrelated])
similarities = model.similarity(embeddings, embeddings)
# parallel_a ↔ parallel_b: ~0.99 (near-verbatim parallel)
# parallel_a ↔ unrelated: ~0.09 (no relationship)
Using Transformers Directly
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("davidmsmiley/miqrabert")
model = AutoModel.from_pretrained("davidmsmiley/miqrabert")
def encode(texts):
inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
output = model(**inputs)
mask = inputs["attention_mask"].unsqueeze(-1)
embeddings = (output.last_hidden_state * mask).sum(1) / mask.sum(1)
return torch.nn.functional.normalize(embeddings, p=2, dim=1)
# 1 Sam 31:6 // 1 Chr 10:6 — death of Saul (synoptic parallel)
emb = encode([
"וַיָּמָת שָׁאוּל וּשְׁלֹשֶׁת בָּנָיו וְנֹשֵׂא כֵלָיו גַּם כָּל־אֲנָשָׁיו בַּיּוֺם הַהוּא יַחְדָּו",
"וַיָּמָת שָׁאוּל וּשְׁלֹשֶׁת בָּנָיו וְכָל־בֵּיתוֺ יַחְדָּו מֵתוּ"
])
similarity = torch.nn.functional.cosine_similarity(emb[0], emb[1], dim=0)
Intended Uses
Use for: measuring semantic similarity between Biblical Hebrew verse pairs; identifying candidate parallel passages across the Hebrew Bible; supporting computational research on inner-biblical allusion and textual reuse.
Not designed for: Modern Hebrew, Rabbinic Hebrew, or Aramaic text. Not optimized for poetic parallelism (see Limitations). Outputs continuous similarity scores — not a binary classifier.
Training
Data
T'OMIM contains 825 parallel and 825 non-parallel Biblical Hebrew verse pairs. Parallels include 556 narrative pairs from Chronicles // Samuel-Kings and 269 poetic pairs from published parallelism studies (Berlin, Fokkelman, Kugel, Tsumura). Negatives are random pairs sampled from the full Hebrew Bible.
Procedure
Cosine similarity regression via CosineSimilarityLoss (MSE). Both verses pass through a shared encoder, are mean-pooled to 768-dim embeddings, and compared via cosine similarity against target labels (1.0 = parallel, 0.0 = non-parallel). This checkpoint uses a 70/15/15 train-validation-test split (1,155 / 247 / 248 pairs), selected from seven configurations (50%–90%) as the optimal balance of separation quality and test set size. Stability validated across 10 random seeds (70 models total).
Hyperparameters
- Epochs: 2
- Batch size: 16
- Learning rate: 5e-05 (linear schedule)
- Optimizer: AdamW
- Seed: 42
- Hardware: NVIDIA T4 GPU (~36 seconds)
Framework Versions
- Sentence Transformers 5.2.0 / Transformers 4.57.3 / PyTorch 2.9.0+cu126
Evaluation
Test Set Performance
| Metric | Score |
|---|---|
| Wasserstein Distance | 0.772 [0.735, 0.809] |
| Overlap Coefficient | 0.046 |
| F1 (threshold = 0.53) | 0.980 |
| Precision / Recall | 0.984 / 0.976 |
| Mean cosine sim (parallel) | 0.880 |
| Mean cosine sim (non-parallel) | 0.108 |
Wasserstein Distance (WD) measures distributional separation between parallel and non-parallel similarity scores; higher is better. Overlap Coefficient (OVL) measures the proportion of ambiguous space where distributions intersect; lower is better. The unfinetuned AlephBERT baseline achieves WD = 0.276 and OVL = 0.240.
Retrieval (Recall@k)
Each query verse is searched against all 68,125 verse and half-verse vectors in the Hebrew Bible (BHSA corpus). Recall@k measures how often the true parallel appears in the top-k results.
| Model | Recall@10 (all) | Recall@10 (narrative) | Recall@10 (poetic) |
|---|---|---|---|
| MiqraBERT-70p | 0.728 | 0.871 | 0.089 |
| BEREL-70p | 0.704 | 0.831 | 0.137 |
| DictaLM-70p | 0.751 | 0.914 | 0.024 |
MiqraBERT is selected as the primary model for its balance across metrics: strong narrative recall, stable training, and the smallest parameter footprint (~110M vs. 7.25B for DictaLM).
Limitations
- Narrative focus: Trained primarily on Chronicles // Samuel-Kings synoptic parallels. Recall@10 for poetic parallelism is only 8.9% — a structural limitation of mean-pooled embeddings for texts with little lexical overlap.
- Biblical Hebrew only: Not evaluated on Modern Hebrew, Rabbinic Hebrew, unvocalized text, or other Semitic languages.
- Training scope: May underperform on intertextual relationships not represented in training (allusions, type-scenes, formulaic speech).
Citation
Paper forthcoming. In the meantime, please cite the model directly:
@misc{smiley2025miqrabert,
title = {MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection},
author = {Smiley, David M.},
year = {2025},
url = {https://huggingface.co/davidmsmiley/miqrabert}
}
Upstream Models
@inproceedings{reimers2019sentencebert,
title = {Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
author = {Reimers, Nils and Gurevych, Iryna},
booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
year = {2019}
}
@article{seker2021alephbert,
title = {AlephBERT: A Hebrew Large Pre-Trained Language Model to Start-off Your Hebrew NLP Application With},
author = {Seker, Amit and Bandel, Elron and Bareket, Dan and Brusilovsky, Idan and Greenfeld, Refael Shaked and Tsarfaty, Reut},
journal = {arXiv preprint arXiv:2104.04052},
year = {2021}
}
- Downloads last month
- 85
Model tree for davidmsmiley/MiqraBERT
Base model
imvladikon/sentence-transformers-alephbertPaper for davidmsmiley/MiqraBERT
Evaluation results
- F1 (threshold=0.53) on T'OMIMself-reported0.980
- Recall@10 (all pairs) on T'OMIMself-reported0.728
- Recall@10 (narrative) on T'OMIMself-reported0.871