Instructions to use mlboydaisuke/VoxCPM2-CoreAI with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VoxCPM
How to use mlboydaisuke/VoxCPM2-CoreAI with VoxCPM:
import soundfile as sf from voxcpm import VoxCPM model = VoxCPM.from_pretrained("mlboydaisuke/VoxCPM2-CoreAI") wav = model.generate( text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.", prompt_wav_path=None, # optional: path to a prompt speech for voice cloning prompt_text=None, # optional: reference text cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed normalize=True, # enable external TN tool denoise=True, # enable external Denoise tool retry_badcase=True, # enable retrying mode for some bad cases (unstoppable) retry_badcase_max_times=3, # maximum retrying times retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech ) sf.write("output.wav", wav, 16000) print("saved: output.wav") - Notebooks
- Google Colab
- Kaggle
VoxCPM2 2B β Core AI (on-device, 48 kHz)
OpenBMB VoxCPM2 (2B) converted to Apple Core AI, running fully on-device on iPhone (A19 Pro / iPhone 17 Pro) and Mac β no network. The 2B, 48 kHz successor to VoxCPM-0.5B-CoreAI.
A tokenizer-free diffusion TTS: a MiniCPM4 28-layer text-semantic LM + an 8-layer residual acoustic LM drive a 12-layer LocDiT flow-matching diffusion head, decoded by a 48 kHz AudioVAE. Five Core AI bundles + a few host-side projections.
Use it
βΆοΈ Run it (source) β the Speak runner (GUI + CLI, one app for every text-to-speech model in the catalog):
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/Speak/Speak.xcodeproj
# β Run, then pick "VoxCPM2 2B" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/Speak
swift run speak-cli --model voxcpm2-2b --text "Hello from Core AI." --output hello.wav
π» Build with it β complete; the glue is kit API, copy-paste runs:
import CoreAIKit
let speaker = try await KitSpeaker(catalog: "voxcpm2-2b")
let audio = try await speaker.synthesize(text)
// audio.samples: 48 kHz mono PCM in [-1, 1] β play it or write a WAV
The take-home is Examples/Speak/Sources/QuickStart.swift
β this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same KitSpeaker(catalog:) and plays the samples.
Live playback? synthesizeStreaming(_:onChunk:) hands you ~0.5 s chunks as they decode,
so audio starts before the whole clip exists. The WAV container is your app's territory
(the runner ships a 20-line writer).
Integration checklist
- SPM:
https://github.com/john-rocky/coreai-kitβ product CoreAIKit - Info.plist: none needed
- Entitlements: none needed
- First run downloads the model β 4.7 GB (Mac) / 5.7 GB (iPhone) β then it loads from the
local cache (Application Support; progress via the
downloadProgresscallback) - Measure in Release β Debug is ~3Γ slower on per-token host work
What's inside
| dir | contents |
|---|---|
macos/ |
JIT .aimodel bundles (Mac): int8 base/res decode + prefill, fp16 feat_decoder / feat_encoder / vocoder |
ios/ |
AOT .aimodelc bundles (iOS h18p, GPU): same five + the two int8 prefill bundles |
voxcpm2_host_glue/ |
embed table + projections / FSQ-512 / stop-head / fusion (.bin + manifest) |
tokenizer/ |
the VoxCPM2 tokenizer (Llama fast) |
The backbone LMs are weight-only int8 (the size driver); the diffusion + VAE stay fp16 (the continuous-feedback path is quant-sensitive β same split mlx-community uses).
On-device numbers (iPhone 17 Pro, int8 + prefill + streaming)
- RTF 1.19, first-audio 0.65 s, 48 kHz, ~4.9 GB resident (increased-memory entitlement).
- Streaming starts after the first ~0.65 s; the 2B is ~4Γ the 0.5B, so RTF sits just above realtime.
Use it
Runs through coreai-kit VoxCPM2TTS, wired into the
coreai-model-zoo coreai-audio app ("Voice 2B" tab).
Conversion + gates + export scripts: coreai-model-zoo/conversion/voxcpm/ (*_v2.py).
let tts = try await VoxCPM2TTS(paths: .standard(artifactsRoot: root, lm: .int8))
let wav = try await tts.synthesize("On device speech synthesis, running entirely on your iPhone.") // 48 kHz Float PCM
Verification
Reimplemented in exportable Core AI overlays and gated end-to-end against the official model: backbone / feat_decoder / feat_encoder cos 1.0, full chain magspec 0.996; every exported bundle engine-gated cos β₯ 0.9999.
License
Apache-2.0 (commercial OK), inherited from openbmb/VoxCPM2. Not affiliated with OpenBMB or Apple. Community port.
Model tree for mlboydaisuke/VoxCPM2-CoreAI
Base model
openbmb/VoxCPM2