Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation
Paper • 2605.28642 • Published
How to use yxdu/ESRT-4B with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="yxdu/ESRT-4B", trust_remote_code=True) # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("yxdu/ESRT-4B", trust_remote_code=True, dtype="auto")This repository contains the weights for ESRT-4B, as presented in the paper Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation.
ESRT supports many-to-many speech-to-text translation across 45 languages (45 × 44 directions). It uses an edge-cloud split inference architecture to protect voice privacy and reduce bandwidth by transmitting only compressed acoustic features instead of raw audio.
# Install uv (if not already installed)
# curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/yxduir/ESRT
cd ESRT
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt
# uv pip install -r requirements_mac.txt
Note: The GPU setup includes
vllm. macOS uses a CPU backend withtransformers.
hf download --repo-type dataset yxdu/fleurs_eng_test --local-dir ./fleurs_eng_test
Two-stage inference: edge side and cloud side.
#Offline for performance evaluation.
#Total 45x44 directions, this is a demo for English->44.
bash run_eng_44.sh
#bash run_test_mac.sh
#Online deployment guide coming soon.
Note: The GPU only supports 'bf16' inference.
Training code will be open-sourced in a future release. Validated on:
| Family | Languages |
|---|---|
| Afro-Asiatic | Arabic, Hebrew |
| Austroasiatic | Khmer, Vietnamese |
| Austronesian | Indonesian, Malay, Tagalog |
| Dravidian | Tamil |
| Indo-European | Bengali, Bulgarian, Catalan, Czech, Danish, Dutch, English, French, German, Greek, Hindi, Croatian, Italian, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Urdu |
| Japonic | Japanese |
| Koreanic | Korean |
| Kra–Dai | Lao, Thai |
| Sino-Tibetan | Chinese, Burmese, Cantonese |
| Turkic | Azerbaijani, Kazakh, Turkish, Uzbek |
| Uralic | Finnish, Hungarian |
@misc{du2026bandwidthefficientprivacypreservingedgecloudmanytomany,
title={Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation},
author={Yexing Du and Kaiyuan Liu and Youcheng Pan and Bo Yang and Ming Liu and Bing Qin and Yang Xiang},
year={2026},
eprint={2605.28642},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.28642},
}