VoxCPM2 2B β€” Core AI (on-device, 48 kHz)

OpenBMB VoxCPM2 (2B) converted to Apple Core AI, running fully on-device on iPhone (A19 Pro / iPhone 17 Pro) and Mac β€” no network. The 2B, 48 kHz successor to VoxCPM-0.5B-CoreAI.

A tokenizer-free diffusion TTS: a MiniCPM4 28-layer text-semantic LM + an 8-layer residual acoustic LM drive a 12-layer LocDiT flow-matching diffusion head, decoded by a 48 kHz AudioVAE. Five Core AI bundles + a few host-side projections.

Use it

▢️ Run it (source) β€” the Speak runner (GUI + CLI, one app for every text-to-speech model in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/Speak/Speak.xcodeproj
# β†’ Run, then pick "VoxCPM2 2B" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/Speak
swift run speak-cli --model voxcpm2-2b --text "Hello from Core AI." --output hello.wav

πŸ’» Build with it β€” complete; the glue is kit API, copy-paste runs:

import CoreAIKit

let speaker = try await KitSpeaker(catalog: "voxcpm2-2b")
let audio = try await speaker.synthesize(text)
// audio.samples: 48 kHz mono PCM in [-1, 1] β€” play it or write a WAV

The take-home is Examples/Speak/Sources/QuickStart.swift β€” this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI drives the same KitSpeaker(catalog:) and plays the samples. Live playback? synthesizeStreaming(_:onChunk:) hands you ~0.5 s chunks as they decode, so audio starts before the whole clip exists. The WAV container is your app's territory (the runner ships a 20-line writer).

Integration checklist

  • SPM: https://github.com/john-rocky/coreai-kit β†’ product CoreAIKit
  • Info.plist: none needed
  • Entitlements: none needed
  • First run downloads the model β€” 4.7 GB (Mac) / 5.7 GB (iPhone) β€” then it loads from the local cache (Application Support; progress via the downloadProgress callback)
  • Measure in Release β€” Debug is ~3Γ— slower on per-token host work

What's inside

dir contents
macos/ JIT .aimodel bundles (Mac): int8 base/res decode + prefill, fp16 feat_decoder / feat_encoder / vocoder
ios/ AOT .aimodelc bundles (iOS h18p, GPU): same five + the two int8 prefill bundles
voxcpm2_host_glue/ embed table + projections / FSQ-512 / stop-head / fusion (.bin + manifest)
tokenizer/ the VoxCPM2 tokenizer (Llama fast)

The backbone LMs are weight-only int8 (the size driver); the diffusion + VAE stay fp16 (the continuous-feedback path is quant-sensitive β€” same split mlx-community uses).

On-device numbers (iPhone 17 Pro, int8 + prefill + streaming)

  • RTF 1.19, first-audio 0.65 s, 48 kHz, ~4.9 GB resident (increased-memory entitlement).
  • Streaming starts after the first ~0.65 s; the 2B is ~4Γ— the 0.5B, so RTF sits just above realtime.

Use it

Runs through coreai-kit VoxCPM2TTS, wired into the coreai-model-zoo coreai-audio app ("Voice 2B" tab). Conversion + gates + export scripts: coreai-model-zoo/conversion/voxcpm/ (*_v2.py).

let tts = try await VoxCPM2TTS(paths: .standard(artifactsRoot: root, lm: .int8))
let wav = try await tts.synthesize("On device speech synthesis, running entirely on your iPhone.") // 48 kHz Float PCM

Verification

Reimplemented in exportable Core AI overlays and gated end-to-end against the official model: backbone / feat_decoder / feat_encoder cos 1.0, full chain magspec 0.996; every exported bundle engine-gated cos β‰₯ 0.9999.

License

Apache-2.0 (commercial OK), inherited from openbmb/VoxCPM2. Not affiliated with OpenBMB or Apple. Community port.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/VoxCPM2-CoreAI

Base model

openbmb/VoxCPM2
Finetuned
(16)
this model