Pre-BEREL: tbd
State-of-the-art language model for Rabbinic Hebrew, released [here] - add link.
This model is the first ever Hebrew model fully pretrained on pre-segmented Hebrew texts. When inputting text to the model, the text is expected to be pre-segmented using pre-BEREL. Segmenting the text prior to training is the first step towards integrating morphological-aware-tokenization into language models.
Sample usage:
from transformers import AutoModel, AutoTokenizer, AutoModelForMaskedLM
sentence = 'ืืื ืืฉืื ืืจืืืดื ืืคืืจืืฉื ืขื ืืชืืจื, ืฉืืืืจ ืืืืข ืืืคืืจืกื ืืื ืืขืื ืืขืืื ืฉืืื ืืืงืจื ืืืฆื ืืืื ืคืฉืืื ืืฃ ืขื ืคื ืฉืืืจืฉ ืืืช.'
# First, load in the segmentation model, to preprocess the text
seg_tokenizer = AutoTokenizer.from_pretrained('dicta-il/BEREL-seg')
seg_model = AutoModel.from_pretrained('dicta-il/BEREL-seg', trust_remote_code=True).eval()
segmented_output = seg_model.predict([sentence], seg_tokenizer)[0] # sentence sent as a batch, pick the first one
# we mark the segmented tokens with a special separator, to distinguish them from regular work tokens.
segmented_sentence = ' '.join('ืฃืฃืฃ '.join(segmented_word) for segmented_word in segmented_output[1:-1]) # ignore cls/sep
print(segmented_sentence.replace('ืฃืฃืฃ', '___'))
# ื___ ืื ืืฉืื ื___ ืจืื ืด ื ื___ ืคืืจืืฉื ืขื ื___ ืชืืจื , ืฉืื___ ืืจ ืืืืข ื___ ืืคืืจืกื ื___ ืื ืืขืื ื___ ืขืืื ืฉ___ ืืื ื___ ืืงืจ
ื ืืืฆื ืืืื ืคืฉืืื ืืฃ ืขื ืคื ืฉืื___ ืจืฉ ืืืช .
# we can mask out any word we want - in this case, the easiest is to just do a string replace. We could've masked in the original sentence, or anywhere in the pipeline.
segmented_sentence = segmented_sentence.replace("ืขืืื", "[MASK]")
# Load in the new model
tokenizer = AutoTokenizer.from_pretrained('dicta-il/pre-BEREL')
model = AutoModelForMaskedLM.from_pretrained('dicta-il/pre-BEREL').eval()
output = model(tokenizer.encode(segmented_sentence, return_tensors='pt'))
# the [MASK] is the 24th token (including [CLS])
import torch
top_5 = torch.topk(output.logits[0, 23, :], 5)[1]
print('\n'.join(tokenizer.convert_ids_to_tokens(top_5))) # should print ืงืืื / ืคืฉื / ืืช / ืืืื / ืืืจื
Citation
If you use pre-BEREL in your research, please cite tbd
BibTeX:
tbd
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
- Downloads last month
- 29
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
