Matcha-TTS โ€” LiteRT (on-device, FFT-free, GPU)

On-device English text-to-speech for Android via LiteRT CompiledModel. This is the FFT-free TTS lane: Matcha-TTS pairs a conditional flow-matching (CFM) acoustic model with a HiFi-GAN time-domain vocoder, so there is no FFT/iSTFT anywhere in the synthesis path. 22.05 kHz, LJSpeech voice.

Converted from the official matcha_ljspeech + hifigan_T2_v1 checkpoints with litert-torch, re-authored to be ML-Drift-GPU-clean (per-graph tflite-vs-torch corr 1.000000; end-to-end waveform corr โ‰ฅ0.99). fp16 weights.

Files

File Size In โ†’ Out Delegate (Pixel 8a)
matcha_textenc_fp16.tflite 15 MB emb[1,256,192] + mask[1,1,256] โ†’ mu[1,80,256], logw[1,1,256] GPU
matcha_decoder_fp16.tflite 23 MB x,mu[1,80,512] + t_sin[1,160] + mask[1,1,512] โ†’ v[1,80,512] CPUยน
matcha_vocoder_fp16.tflite 29 MB mel[1,80,512] โ†’ wav[1,1,131072] GPU
dp_g2p_matcha_fp16.tflite 26 MB text[1,96] (char ids) โ†’ logits[1,96,64] (IPA) CPU
emb.bin 0.1 MB phoneme embedding table (178ร—192 f32, host lookup) host
g2p_dict.txt.gz 1.8 MB 275k-entry espeak-IPA dictionary (primary G2P) host
config.json, g2p_meta.json โ€” symbols, shapes, mel stats, G2P tokenizer tables host

ยน The CFM decoder runs on the CompiledModel CPU delegate. It converts GPU-clean and is correct on CPU, but the Mali ML Drift GPU delegate mis-fuses the decoder's transformer blocks at large activation magnitude (the same block is correct as a standalone GPU graph, corr 0.984, but collapses to corr 0.006 fused โ€” a graph-fusion bug, not a bad op). text encoder + vocoder run on the GPU; the GPU vocoder dominates wall time so the pipeline stays realtime (RTF ~0.8).

Pipeline (host orchestration)

text --G2P(CPU dict+neural)--> phoneme ids
     --host: embed + intersperse + pad-->     text_encoder(GPU) -> mu, logw
     --host: durations + length-regulator-->  mu_y[1,80,T]
     --host: Euler ODE loop (N steps)-->        decoder(CPU) x N -> v
     --host: denormalize-->                     vocoder(GPU)     -> waveform

Fixed shapes (256 phonemes, 512 mel frames โ‰ˆ 5.9 s); a runtime float mask makes padded positions a no-op so one compiled graph handles any length.

G2P (espeak-free)

Matcha-LJSpeech is trained on espeak en-us IPA, but espeak is GPL. The clean replacement is a 275k-entry espeak-IPA dictionary (from OpenPhonemizer, Clear BSD) as primary + DeepPhonemizer (MIT) on LiteRT CPU for out-of-dictionary words. Output IPA maps 1:1 onto the keithito 178-symbol set.

Sample

See the LiteRT compiled_model_api/text_to_speech sample (Matcha-TTS) in google-ai-edge/litert-samples for the full Android app and the conversion scripts.

License

Model: MIT (Matcha-TTS / HiFi-GAN). G2P dict: Clear BSD (OpenPhonemizer) + MIT (DeepPhonemizer).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support