fluxions.ai

Vui β€” Streaming Conversational Voice Assistant

Pronounced "vooey" (rhymes with Louie) Β· by fluxions.ai

GitHub Discord

πŸ‘‰ Full code, install, docs, and the streaming voice assistant: github.com/fluxions-ai/vui

πŸ“– Launch blog post β€” design notes, demos, and what's next.

Vui is a real-time voice assistant: speak into your mic, the model transcribes, runs a local LLM, and streams a TTS reply back β€” all from a single Python server. Built around Vui Nano, a 300M speech transformer based on the Qwen3 TTS. Trained on conversational speech with breaths, laughter, hesitations, and multi-speaker dialogue.

Features

  • Vui Nano (300M) β€” Llama-style decoder + RQ-Transformer head over the Qwen3-TTS-12Hz codec
  • Real-time voice loop β€” WebRTC + WebSocket pipeline (ASR β†’ LLM β†’ TTS) with a browser UI, VAD-driven turn taking, speculative LLM prefill while you're still speaking, sentence-level TTS chunking with backpressure
  • Barge-in β€” start talking mid-reply, the model cancels and listens
  • Streaming TTS β€” ~9Γ— realtime on a 4090, bf16 inference, CUDA graphs
  • OpenAI Realtime API compatible β€” drop-in ws://…/v1/realtime for clients written against OpenAI's spec (docs/realtime-api.md)
  • One-shot voice-note REST endpoint β€” POST /v1/voice-note runs the whole ASR β†’ LLM β†’ TTS pipeline in a single HTTP call (audio in, JSON out)
  • Standalone TTS demo β€” demo.py Gradio playground for the model on its own
  • Voice cloning β€” upload an audio sample to clone any speaker; 4 fine-tuned presets shipped (maeve, abraham, rhian, harry)
  • SQ / WPS conditioning β€” bias generation on six speech-quality channels and words-per-second
  • Hot-swap models β€” pick Ollama LLM and ASR backend live from the UI
  • Pluggable ASR β€” faster-whisper (GPU) or Moonshine (CPU streaming, ONNX)
  • Pluggable LLM backends β€” Ollama, vLLM, any OpenAI-compatible endpoint
  • Memories β€” assistant remembers facts about you across sessions
  • Thoughts stream β€” parallel LLM routes voice intent to ~10 tools (memory ops, task control, delegation) without a wake-word grammar; pluggable for your own local tools
  • Optional Claude task server β€” sidecar agent that handles slow/agentic work (Gmail, Calendar, Drive, Slack, web search) via your existing Claude Code MCPs
  • Apple Silicon support β€” MLX backend (WIP)
  • Mobile-ready β€” documented cloudflared and Tailscale paths for phone access with mic over HTTPS
  • Docker compose β€” one file brings up the full stack
  • OpenClaw integration β€” point OpenClaw's openai realtime provider at Vui for a fully-local voice front-end

Install (one-liner)

curl -fsSL https://install.fluxions.ai | bash

Clones into ~/vui, auto-detects Docker vs. native, installs deps (uv, Ollama, ffmpeg, Claude Code CLI), pulls the model from this repo, and launches the stack on http://localhost:8080.

Full Docker compose / native install, mobile setup, configuration, ASR options, and the Claude task server β€” all in the GitHub README.

TTS demo on its own

git clone https://github.com/fluxions-ai/vui
cd vui
uv sync
python demo.py                                          # Gradio UI β€” upload your own voice prompt
python demo.py --render --prompt prompts/abraham.wav    # CLI render with a preset voice

The Vui checkpoint and Qwen codec download automatically from this repo on first run.

Preset voices

Voice Description
maeve Recommended Default β€” Female Irish accent, beautiful but may be hard for non-UK listeners
abraham British, well-spoken, exciting energy and personality β€” conscientious, good at emotionally difficult subjects
rhian More traditional British accent, slightly hesitant speaking style
harry British male accent, mumbly

More personalities coming soon! Got a voice or character you'd like to hear? Open an issue or let us know on Discord.

Python API

from vui.engine import Engine, GenConfig

engine = Engine.from_checkpoint("vui-nano.safetensors")
with engine.new_row() as row:
    audio = row.render(
        "So [breath] the thing about this is, it's not what you'd expect, right?",
        GenConfig(temperature=0.7),
    )

Tip: try turning repetition penalty off. GenConfig defaults rep_penalty=1.1 to break long silence/filler loops, but it can flatten prosody and distort natural repetition. Setting it to 0 (anything <= 1.0 disables the penalty path) often gives more natural-sounding output β€” worth trying if generations sound stilted or over-corrected.

For long voice prompts (>15s) you need proper multi-segment chunking β€” vui.prompt_utils.build_prompt_segments does ASR + forced alignment + sentence-boundary splits at ~10s targets so the model keeps its speaker conditioning across the full reference. Full Python guide covering chunked prompts, streaming, continuous batching, codes-only decode, and the MLX path: docs/python-api.md.

Vui Nano

A 300M autoregressive LM over the Qwen3-TTS speech codec β€” the first in the Vui model family. The codec and speaker encoder are reused from Alibaba's Qwen3-TTS-12Hz-0.6B-Base;

  • 300M parameters, Llama-style decoder + RQ-Transformer head β€” 768 dim, 22 layers, 8 heads
  • Codec: Qwen3-TTS-Tokenizer-12Hz β€” 16 codebooks of 2048 entries at 12.5 Hz, 24 kHz audio (decoded), pure-PyTorch reimplementation in src/vui/qwen_codec.py
  • Speaker encoder: ECAPA-TDNN from Qwen3-TTS-12Hz-0.6B-Base (8.9M params, 1024-dim) β€” used at training time to embed reference speakers
  • Output: 24 kHz audio, bf16 inference, ~9Γ— realtime streaming on a 4090

Voices & voice cloning

The model can clone arbitrary voices β€” upload a sample in the demo UI (or drop a .wav into prompts/) and it will follow that speaker. Cloned voices won't sound as good as the four fine-tuned voices (maeve, abraham, rhian, harry) shipped in prompts/ β€” the released checkpoint has been fine-tuned on those four, so they're the highest-quality output the model can produce. Arbitrary clones work but expect lower naturalness, more drift, and some bias toward the fine-tuned speakers' prosody.

For best results: voice-prompt transcript must match the audio word-for-word, aim for 30 seconds or more of clean source audio (6-minute context window), and remember garbage in = garbage out. Full guide on voice prompts, supported tags ([breath], [laugh], [sigh] …), punctuation rules, and phonetic spelling for numbers/dates/units: docs/prompting.md.

If you need a checkpoint tuned to a specific voice for a legitimate use case (audiobooks, accessibility, game characters, dubbing of consenting performers, internal tooling), get in touch via fluxions.ai β€” we can train, license, or host one for you.

Hardware

Streaming server and demo.py both run on either:

  • NVIDIA GPU + Linux β€” 12 GB VRAM for the full stack (TTS + ASR + Ollama LLM, 4090 / H100 tested), drops to **8 GB** if you switch to a moonshine.* (CPU) ASR backend. CUDA 12.x, flash-attn installed.
  • Apple Silicon Mac β€” M1/M2/M3/M4, MLX backend (auto-detected, no flash-attn required).

Full breakdown β€” measured per-component VRAM, ASR latency/VRAM per backend, KV-cache math, and tuning levers β€” is in docs/memory-budget.md.

Tip: drop n_codebooks for faster TTS on smaller GPUs. The RQ-Transformer head decodes 16 RVQ codebook levels per audio frame by default. Dropping the Codebooks slider in the UI (or n_codebooks in DEFAULT_SETTINGS) to ~10 gives noticeably faster decode and lower VRAM at the cost of some stability β€” occasional artefacts, more sensitivity to hard prompts. Below 8 quality drops sharply. 0 means "use all 16".

Responsible use

Vui generates speech that can sound convincingly human. By using this model β€” directly, through the streaming server, or through the realtime API β€” you agree to the following:

We explicitly prohibit:

  • Fraud β€” generating speech to deceive others for financial gain or to obtain something you would not otherwise be entitled to (scam calls, voice-auth bypass, etc.).
  • Misinformation or deception β€” fake news, fraudulent calls, deepfakes intended to mislead, synthetic media presented as authentic recordings of real people.
  • Harassment, defamation, or abuse β€” generating speech that targets, threatens, or harms others, including non-consensual sexual content.
  • Illegal activity β€” anything unlawful in the jurisdiction where the model is run or its output is distributed.

You are responsible for what you generate. The released checkpoint is fine-tuned to a curated voice set in part to make these misuses harder, but it is not a substitute for your own judgment. If you build a product on top of Vui, build in consent flows, content provenance (e.g. C2PA), and abuse reporting.

We are not responsible for misuse, and we strongly condemn unethical applications of this technology.

Attributions

License

Apache 2.0 β€” applies to the code in the GitHub repo and the released model weights. The Qwen3-TTS-Tokenizer-12Hz codec and Qwen3-TTS-12Hz-0.6B-Base speaker encoder are Β© Alibaba and licensed under the terms in their respective Hugging Face repos.

Citation

@software{vui_2026,
  author  = {Coultas Blum, Harry},
  title   = {Vui: Streaming Conversational Text-to-Speech},
  url     = {https://github.com/fluxions-ai/vui},
  version = {1.0.0},
  year    = {2026}
}
Downloads last month
2,076
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 2 Ask for provider support

Space using fluxions/vui 1