SenseVoiceSmall โ CoreML (Apple Neural Engine)
CoreML conversion of FunAudioLLM/SenseVoiceSmall for on-device inference on Apple Silicon, intended for FluidInference/FluidAudio (tracks issues #645 / #646).
SenseVoiceSmall is a non-autoregressive multilingual ASR model (~234M params, SANM encoder + single CTC head) covering 50+ languages, with emotion and audio-event tags. One forward pass yields all output tokens.
Files (3-stage pipeline)
| File | Precision | Compute unit | Size | Role |
|---|---|---|---|---|
SenseVoicePreprocessor.mlmodelc |
FLOAT32 | CPU | 3 MB | front-end: waveform โ 560-d LFR features |
SenseVoiceSmall.mlmodelc |
FLOAT16 | CPU_AND_NE (ANE) |
447 MB | default encoder+CTC |
SenseVoiceSmall_int8.mlmodelc |
INT8 (weights) | CPU_AND_NE (ANE) |
225 MB | ~half size, accuracy-neutral |
SenseVoiceSmall_fp32.mlmodelc |
FLOAT32 | any | 897 MB | encoder fallback (non-ANE) |
vocab.json |
โ | โ | โ | 25055 SentencePiece tokens (array form) |
int8 is post-training weight quantization (linear_symmetric), accuracy-neutral
vs fp16 on the full canonical sets: LibriSpeech test-clean WER 3.22โ3.25% (2,620),
AISHELL-1 test CER 3.09โ3.09% (7,176) โ ฮ +0.03 pp / 0.00 pp, 0 NaN on ANE, peak
RAM 0.54โ0.32 GB. Pick it for ~half the on-disk/memory footprint.
Pipeline: waveform โ [Preprocessor, fp32/CPU] โ features โ [encoder+CTC, fp16/ANE] โ logits โ host greedy-CTC decode.
โ ๏ธ Compute-unit requirement. The FLOAT16 encoder is numerically correct on the Neural Engine but produces NaN on the CPU/GPU fp16 path. Load it with
MLModelConfiguration.computeUnits = .cpuAndNeuralEngine. On hardware without ANE (or under ANE fallback), useSenseVoiceSmall_fp32. The preprocessor must run fp32 (power-spectrum/log exceed fp16 range).
I/O
SenseVoicePreprocessor โ in: waveform [1, N] fp32 (16 kHz, scaled ร32768
like kaldi; flexible length). out: features [1, T, 560] fp32.
SenseVoiceSmall (encoder+CTC):
| name | shape | dtype | notes |
|---|---|---|---|
speech |
[1, T, 560] |
fp32 | preprocessor output; T โ enumerated buckets [128,256,512,1024,1800] (pad up) |
speech_lengths |
[1] |
int32 | valid frame count (before padding) |
language |
[1] |
int32 | embed index; 0 = auto |
textnorm |
[1] |
int32 | 15 = no inverse text-norm (woitn), 14 = withitn |
Output: ctc_logits [1, T+4, 25055] โ the 4 leading positions are the
language/emotion/event/itn query tokens; the rest are the transcript.
Host pre/post-processing
Pre: handled by SenseVoicePreprocessor (kaldi fbank80 โ LFR m=7,n=6 โ CMVN,
matching FunASR WavFrontend to max|ฮ|โ2e-5). Pad its output up to the smallest
encoder bucket โฅ T.
Post (decode): greedy CTC over ctc_logits โ collapse repeats โ drop blank
(id 0) โ SentencePiece detokenize โ strip <|...|> tags for the clean
transcript. Reference Python in the repo's decode.py.
language/textnorm are embed indices, mapped on the host:
lid_int_dict = {24884:3, 24885:4, 24888:7, 24892:11, 24896:12, 24992:13} # <|zh|> etc -> embed idx
textnorm_int_dict = {25016:14, 25017:15}
# language not in dict -> 0 (auto)
Verification & benchmarks
Conversion = PyTorch (FunASR) โ torch.jit.trace โ coremltools (FLOAT16,
EnumeratedShapes, iOS17). Measured on this machine (M-series), FunASR 1.3.9 /
coremltools 8.3.
End-to-end correctness: on the cached zh sample, the CoreML(ANE) โ greedy-CTC pipeline reproduces FunASR
am.generateexactly:<|zh|><|NEUTRAL|><|Speech|><|woitn|>ๆฌข่ฟๅคงๅฎถๆฅไฝ้ช่พพๆฉ้ขๆจๅบ็่ฏญ้ณ่ฏๅซๆจกๅParity (torch โ CoreML, ANE): CTC argmax token agreement 100% on real audio.
LibriSpeech test-clean (canonical โ matches the official chart): CoreML(ANE) 3.21% WER (torch 3.26%) on n=100 vs the published SenseVoice-Small ~3.1%. Confirms the full pipeline (front-end + CoreML + decode) reproduces the paper. (Full 2620-utt split number: see repo README.)
FLEURS WER (CoreML ANE vs torch), 100 samples/lang โ conversion is accuracy-neutral:
lang torch CoreML (ANE) ฮ RTFx en_us (WER) 9.52% 9.52% +0.00pp 402 cmn_hans_cn (CER) 9.60% 9.57% โ0.03pp 372 FLEURS is a harder/different read-speech set than LibriSpeech/Aishell โ its absolute numbers are not comparable to the official benchmark chart; it's used here only for cross-language CoreMLโtorch parity.
RTFx (5.55 s clip, by bucket, ANE): 128โ524, 256โ274, 512โ97, 1024โ36, 1800โ14.5. (M-series; iPhone ANE not yet measured.)
License & attribution
Weights derive from FunAudioLLM/SenseVoiceSmall; the upstream model license applies. This repo only contains a format conversion (no retraining). See the SenseVoice and FunASR projects.