Darsala/english_georgian_corpora
Viewer • Updated • 866k • 66
This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder.
RichNachos/georgian-corpus-tokenizer-testImportant: This model uses a custom EncoderDecoderTokenizer that is included in the repository. You need to download the repo locally to access it.
import sys
from transformers import EncoderDecoderModel
import torch
import re
from huggingface_hub import snapshot_download
# Download the repo to a local folder
path_to_downloaded = snapshot_download(
repo_id="Darsala/Georgian-Translation",
local_dir="./Georgian-Translation",
local_dir_use_symlinks=False
)
# Add the downloaded folder to Python path so we can import the custom tokenizer
sys.path.append(path_to_downloaded)
from encoder_decoder_tokenizer import EncoderDecoderTokenizer
# Load the model and tokenizer from the downloaded folder
model = EncoderDecoderModel.from_pretrained(path_to_downloaded)
tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
def translate(
text: str,
num_beams: int = 5,
max_length: int = 256,
) -> str:
"""
Translate a single string with the given EncoderDecoderModel.
"""
text = text.lower()
text = re.sub(r'\s+', ' ', text)
# tokenize & move to device
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding="longest"
).to(device)
# generation
generated_ids = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
num_beams=num_beams,
max_length=max_length,
early_stopping=True,
)
output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"English: {text}")
print(f"Translated: {output}")
return output
# Example usage
translation = translate("Hello, how are you?")
Note: The model uses a custom EncoderDecoderTokenizer that is included in the repository.
Try the model in the interactive demo: Georgian Translation Space
@mastersthesis{darsalia2025georgian,
title={English Translation Quality Assessment and Computer Translation},
author={Luka Darsalia},
year={2025},
school={Tbilisi University},
note={Bachelor's Thesis - Computer Science}
}
Base model
google-bert/bert-base-uncased