ESRT: Edge-cloud Speech Recognition and Translation

This repository contains the weights for ESRT-4B, as presented in the paper Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation.

ESRT supports many-to-many speech-to-text translation across 45 languages (45 × 44 directions). It uses an edge-cloud split inference architecture to protect voice privacy and reduce bandwidth by transmitting only compressed acoustic features instead of raw audio.

Paper: arXiv:2605.28642
Code: https://github.com/yxduir/esrt

Timeline

2026-05-29 — macOS CPU support added
2026-05-28 — ESRT-4B has been released on Hugging Face with GPU support.

Setup

# Install uv (if not already installed)
# curl -LsSf https://astral.sh/uv/install.sh | sh

git clone https://github.com/yxduir/ESRT
cd ESRT
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt 

# uv pip install -r requirements_mac.txt

Note: The GPU setup includes vllm. macOS uses a CPU backend with transformers.

Test Data

hf download --repo-type dataset yxdu/fleurs_eng_test --local-dir ./fleurs_eng_test

Inference

Two-stage inference: edge side and cloud side.


#Offline for performance evaluation. 
#Total 45x44 directions, this is a demo for English->44.
bash run_eng_44.sh

#bash run_test_mac.sh 
#Online deployment guide coming soon.

Note: The GPU only supports 'bf16' inference.

Training

Training code will be open-sourced in a future release. Validated on:

GPU: NVIDIA A100 80GB × 8
NPU: Huawei Ascend 910C 64GB × 8

Supported Languages

Family	Languages
Afro-Asiatic	Arabic, Hebrew
Austroasiatic	Khmer, Vietnamese
Austronesian	Indonesian, Malay, Tagalog
Dravidian	Tamil
Indo-European	Bengali, Bulgarian, Catalan, Czech, Danish, Dutch, English, French, German, Greek, Hindi, Croatian, Italian, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Urdu
Japonic	Japanese
Koreanic	Korean
Kra–Dai	Lao, Thai
Sino-Tibetan	Chinese, Burmese, Cantonese
Turkic	Azerbaijani, Kazakh, Turkish, Uzbek
Uralic	Finnish, Hungarian

Citation

@misc{du2026bandwidthefficientprivacypreservingedgecloudmanytomany,
      title={Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation}, 
      author={Yexing Du and Kaiyuan Liu and Youcheng Pan and Bo Yang and Ming Liu and Bing Qin and Yang Xiang},
      year={2026},
      eprint={2605.28642},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.28642}, 
}

Downloads last month: 184

Safetensors

Model size

5B params

Tensor type

BF16

Paper for yxdu/ESRT-4B

Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

Paper • 2605.28642 • Published 4 days ago