Pre-BEREL: tbd

State-of-the-art language model for Rabbinic Hebrew, released [here] - add link.

This model is the first ever Hebrew model fully pretrained on pre-segmented Hebrew texts. When inputting text to the model, the text is expected to be pre-segmented using pre-BEREL. Segmenting the text prior to training is the first step towards integrating morphological-aware-tokenization into language models.

Sample usage:

from transformers import AutoModel, AutoTokenizer, AutoModelForMaskedLM

sentence = 'ื•ื–ื” ืœืฉื•ืŸ ื”ืจืžื‘ืดืŸ ื‘ืคื™ืจื•ืฉื• ืขืœ ื”ืชื•ืจื”, ืฉื”ื“ื‘ืจ ื™ื“ื•ืข ื•ืžืคื•ืจืกื ืœื›ืœ ื‘ืขืœื™ ื”ืขื™ื•ืŸ ืฉืื™ืŸ ื”ืžืงืจื ื™ื•ืฆื ืžื™ื“ื™ ืคืฉื•ื˜ื• ืืฃ ืขืœ ืคื™ ืฉื”ื“ืจืฉ ืืžืช.'

# First, load in the segmentation model, to preprocess the text
seg_tokenizer = AutoTokenizer.from_pretrained('dicta-il/BEREL-seg')
seg_model = AutoModel.from_pretrained('dicta-il/BEREL-seg', trust_remote_code=True).eval()

segmented_output = seg_model.predict([sentence], seg_tokenizer)[0] # sentence sent as a batch, pick the first one

# we mark the segmented tokens with a special separator, to distinguish them from regular work tokens. 
segmented_sentence = ' '.join('ืฃืฃืฃ '.join(segmented_word) for segmented_word in segmented_output[1:-1]) # ignore cls/sep
print(segmented_sentence.replace('ืฃืฃืฃ', '___'))
# ื•___ ื–ื” ืœืฉื•ืŸ ื”___ ืจืžื‘ ืด ืŸ ื‘___ ืคื™ืจื•ืฉื• ืขืœ ื”___ ืชื•ืจื” , ืฉื”ื“___ ื‘ืจ ื™ื“ื•ืข ื•___ ืžืคื•ืจืกื ืœ___ ื›ืœ ื‘ืขืœื™ ื”___ ืขื™ื•ืŸ ืฉ___ ืื™ืŸ ื”___ ืžืงืจ
ื ื™ื•ืฆื ืžื™ื“ื™ ืคืฉื•ื˜ื• ืืฃ ืขืœ ืคื™ ืฉื”ื“___ ืจืฉ ืืžืช .

# we can mask out any word we want - in this case, the easiest is to just do a string replace. We could've masked in the original sentence, or anywhere in the pipeline.
segmented_sentence = segmented_sentence.replace("ืขื™ื•ืŸ", "[MASK]")

# Load in the new model
tokenizer = AutoTokenizer.from_pretrained('dicta-il/pre-BEREL')
model = AutoModelForMaskedLM.from_pretrained('dicta-il/pre-BEREL').eval()

output = model(tokenizer.encode(segmented_sentence, return_tensors='pt'))
# the [MASK] is the 24th token (including [CLS])
import torch
top_5 = torch.topk(output.logits[0, 23, :], 5)[1]
print('\n'.join(tokenizer.convert_ids_to_tokens(top_5))) # should print ืงื‘ืœื” / ืคืฉื˜ / ื“ืช / ื—ื›ืžื” / ื’ืžืจื

Citation

If you use pre-BEREL in your research, please cite tbd

BibTeX:

tbd

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Downloads last month
29
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support