TTS Comparison — Local AI Voice Models

English — "The 6th Day of December, Anno 2025"

884 words. Mock-archaic English about a day in Ayutthaya: bicycles, ruins, dogs, food.

Speed scale: ≤0.1 RTF 10×+ realtime · ≤1.0 RTF realtime or better · 1-2 RTF slower than realtime · 2+ RTF much slower

The audio (8 voices)

Piper — Northern English MaleFASTEST

Beast (gfx1151)

Engine:Piper en_GB-northern (ONNX medium) Voice:Preset, Northern English male Gen time:4.7s Audio:252s Speed:53× realtime

VITS architecture. Older but still solid. Northern accent variant.

Piper — Alan

Beast (gfx1151)

Engine:Piper en_GB-alan (ONNX medium) Voice:Preset, British male Gen time:5.5s Audio:310s Speed:56× realtime

Standard British male voice. Fast and intelligible. Sounds slightly mechanical.

Kokoro — bm_george

Beast (gfx1151)

Engine:Kokoro-82M (ONNX) Voice:Preset, British male Gen time:31s Audio:300s Speed:10× realtime

Oliver's verdict: "quite good." Default British male — clear and warm.

Kokoro — bm_lewis

Beast (gfx1151)

Engine:Kokoro-82M (ONNX) Voice:Preset, British male alt Gen time:31s Audio:307s Speed:10× realtime

Alternate British male voice. Same Kokoro engine.

Kokoro — bf_emma

Beast (gfx1151)

Engine:Kokoro-82M (ONNX) Voice:Preset, British female Gen time:27s Audio:275s Speed:10× realtime

Female variant. Same Kokoro engine.

Qwen3-TTS — Stephen FryPAUL'S PICK

Mac M4 MLX

Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX) Voice:Cloned from 25s of Penguin Audiobooks Mythos sample Gen time:187s Audio:290s Speed:1.5× slower than realtime

Studio-quality reference clip = clean output. Voice cloning is Qwen's killer feature — match any speaker from a short reference.

Qwen3-TTS — Humphries

Mac M4 MLX

Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX) Voice:Cloned, posh British male Gen time:179s Audio:276s Speed:1.5× slower than realtime

Oliver: "sounds like in a hall or room." Reference clip had studio ambience that got cloned along with the voice.

Higgs Audio v2 — default

Mac CPU (GGUF Q6_K)

Engine:Higgs Audio v2 GGUF Q6_K (BosonAI) Voice:Higgs default (no clone in this test) Gen time:106s Audio:62s Speed:1.7× slower than realtime

Mood steered via scene-description per chunk. CPU-only path on Mac.

📖 View the English script

German — "Der 9. Tag des Dezember, Anno 2025"

~400 words. Mock-archaic German about a day in Khao Yai: bicycles, ruins, swimming pool, German Christmas music in tropical heat.

Native vs cloned: Piper voices are recorded by native German speakers — no English accent. Qwen-cloned voices carry the original speaker's accent (Stephen Fry speaks German with an English accent — Alberto flagged this).

Native German voices (recommended)

Qwen3-TTS — Rufus BeckNEW · NATIVE NARRATOR

Mac M4 MLX

Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX) Voice:Cloned Rufus Beck — German Harry Potter audiobook narrator (Random House Audio, official sample) Gen time:74s Audio:~125s Speed:~1.7× slower than realtime

Native German via voice cloning. Rufus Beck is the German equivalent of Stephen Fry — narrated all 7 Harry Potter books in German + many other audiobooks. Clean studio source. Compare to Piper Thorsten below for ONNX vs cloned-narrator quality.

Piper — ThorstenGOLD STANDARD

Beast (gfx1151)

Engine:Piper de_DE-thorsten (ONNX medium) Voice:Native German male (Thorsten Müller — the famous thorsten-voice dataset) Gen time:2.7s Audio:109s Speed:40× realtime

The reference quality for free German TTS. Native speaker, studio recording. No accent.

Piper — Ramona

Beast (gfx1151)

Engine:Piper de_DE-ramona (ONNX low) Voice:Native German female Gen time:2.3s Audio:127s Speed:55× realtime

Friendly female native German. Quality slightly below Thorsten ("low" variant).

Piper — Eva K

Beast (gfx1151)

Engine:Piper de_DE-eva_k (ONNX x-low) Voice:Native German female alt Gen time:2.3s Audio:143s Speed:62× realtime

Smallest model size. Tinny relative to Thorsten but native German.

Cloned English voices speaking German (English accent expected)

Qwen3-TTS — Stephen Fry

Mac M4 MLX

Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX) Voice:Cloned Stephen Fry speaking German Gen time:73s Audio:126s Speed:1.7× slower than realtime

English speaker speaking German = English accent. Novelty value; not the right choice for actual German content. For native German via Qwen we need a German reference clip (Rufus Beck coming soon).

Qwen3-TTS — Humphries

Mac M4 MLX

Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX) Voice:Cloned "humphries" (British) speaking German Gen time:~70s Audio:~140s Speed:~1.7× realtime

Same issue — British speaker reading German with British inflection.

📖 View the German script

Emotion Tags — ElevenLabs-style inline expression

Same Dec 6 English text re-rendered with inline emotion tags. Tag format differs per engine. Listen for chuckles, sighs, breathless urgency at the marked points.

Tag formats: Chatterbox uses square brackets [chuckle] · Orpheus uses angle brackets <chuckle> · Higgs uses chunk-level scene descriptions (no inline tags)

Chatterbox-Turbo — Stephen FryREALTIME

Mac M4 MPS

Engine:Chatterbox-Turbo (Resemble AI, MIT) Voice:Cloned Stephen Fry Tags:[chuckle] [sigh] [whispers] [breathless] [satisfied] Gen time:115s Audio:120s Speed:0.96 (~realtime)

Native bracket-tag support like ElevenLabs v3. Combines voice cloning + emotion. Beat ElevenLabs in blind tests (65.3% vs 24.5%) per Resemble's benchmarks.

Orpheus 3B — dan

Mac CPU Metal

Engine:Orpheus 3B (Canopy Labs, Apache-2.0) via orpheus-cpp Voice:Preset "dan" (male English, no cloning) Tags:<chuckle> <sigh> <gasp> Gen time:124s Audio:71s Speed:1.75× slower than realtime

3B llama-based architecture. Native emotion tag support via <laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>. 8 preset voices (no cloning).

Higgs Audio v2 — scene-desc

Mac CPU (GGUF Q6_K)

Engine:Higgs Audio v2 GGUF Q6_K Voice:Higgs default Tags:Scene-desc per chunk (no inline tags) Gen time:106s Audio:62s Speed:1.7× slower than realtime

No inline brackets — instead the system prompt sets the mood per text chunk. E.g. chunk 1 system prompt: "narrator with excited energetic tone"; chunk 4: "tension and breathless urgency." Listen for whether mood shifts between segments.

Where the emotion tags landed in the text

Paragraph 1 — Anticipation

[excited] This day did I awake with great anticipation, for our kind hosts at the lodging house had made available unto us two wheeled velocipedes, that we might explore these ancient lands with greater liberty. Diana and I did first repair to the establishment known as Busaba Café, wherein we partook of most delectable waffles adorned with ices of coconut and matcha [chuckle], the latter being a verdant preparation of the Orient that doth refresh the palate most wonderfully.

Paragraph 2 — Heat and oppression

We did venture forth upon our bicycles to survey two magnificent ruins of antiquity, the first containing a most wondrous sight — the head of Buddha himself, grown about and embraced by the roots of a great tree, as if Nature herself did worship at his feet. The heat upon this day was most oppressive, the sun beating down with such fury that the very air registered some thirty-two degrees [sigh], and the perspiration did flow freely from our brows as rivers flow to the sea.

Paragraph 3 — Tender moment

[whispers] Before this mystical threshold sat a small child, a girl of tender years, who did endeavour to sell little fishes of her own crafting; I did press twenty baht into her small hand, though I purchased naught of her wares, for charity is a virtue most pleasing to Providence.

Paragraph 4 — Wild dog chase

Yet this day was not without its terrors, for I myself was pursued by a wild dog for no less than fifty metres along the road [breathless], the animal barking most ferociously at my heels, and this experience did quite diminish our enthusiasm for further cycling adventures.

Paragraph 5 — Resolution

[satisfied] At a halal establishment we did feast most excellently upon seafood and shrimps prepared in a sauce of medium spiciness, and thus we concluded this day of adventures, misadventures, ancient wonders, and the ever-present kindness of strangers, returning to our rest with grateful hearts and the promise of new discoveries upon the morrow [chuckle].

Note: Chatterbox + Orpheus interpret these brackets/angle-brackets natively. Higgs uses chunk-level mood descriptions instead — paragraph 1 prompt was "excited energetic," paragraph 4 was "tension and breathless urgency," paragraph 5 was "warm satisfaction."

Video Avatar — LongCat-Video-Avatar on H100

Talking-head video generation from a still photo + voice clip. 13.6B-parameter model from MeiGen-AI / Meituan. Runs on cloud H100 via Modal (cannot run locally — no GPU on Mac/Beast can handle it).

Inputs

Reference photo (Wikipedia / Berlinale 2024, 2192×2908 resized to 512px)

Driving audio (first 10s of Qwen3-TTS Stephen Fry rendering of Dec 6 text)

Prompt: "A distinguished British gentleman speaking with warmth and enthusiasm"

Output v1.0 — 5.8s (single segment)

Resolution:544×736 @ 16 fps Duration:5.8s Model:LongCat-Video-Avatar v1.0 (fixed-length, ~5s per run) Cost:~$0.40

v1.5 chained — 16.5s talking head

Resolution:544×736 @ 25 fps Duration:16.5s (3× longer than v1.0) Model:LongCat-Video-Avatar v1.5 + 5 chained segments + distill mode Audio input:30s of Qwen3-TTS Stephen Fry rendering of Dec 6 text Wall clock:~18 min (5 segments serial) Cost:~$1.20

Each "segment" produces ~3.7s of video using the last few frames of the previous segment as conditioning. Quality slowly drifts across segments — the face stays consistent but micro-expressions diverge from the source photo. This is the "Cross-Chunk Latent Stitching" feature LongCat introduced.

What's actually happening

LongCat-Video-Avatar takes a single still image of a face, an audio clip of speech, and a text prompt describing the scene/mood. The 13.6B-parameter diffusion model produces a video where the still face is animated to lip-sync the audio with believable facial expressions, head movement, and micro-gestures.

Why this needs cloud GPU

Beast can't run it: model uses NVIDIA flash-attention-2 kernels (no ROCm port) AND block-sparse attention via Triton (CUDA-only). PyTorch+ROCm is also broken on the AMD gfx1151 chip.
Mac can't run it either: 13.6B params + custom CUDA Triton kernels = no realistic path to Metal/MPS without weeks of porting work.
Cloud H100: per-second billing, container sleeps when idle (free), $30/month free Modal credit covers ~60-100 generations.

Next experiments

Longer audio (whole Dec 6 passage → ~5 min video) — needs num_segments setting for video continuation
Different starting images (Alberto, Diana, the Rufus Beck-cloned German voice in a German face)
720p instead of 480p (longer gen, ~$1 per run)

About this comparison

Hardware

Mac M4 (Apple Silicon) — MLX framework for Qwen3-TTS, MPS for Chatterbox, Metal for Orpheus, CPU for Higgs.

Beast (Beelink GTR9 Pro, AMD Ryzen AI MAX+ 395, Radeon 8060S iGPU gfx1151, 128GB unified RAM, Ubuntu 24.04) — ONNX runtime on CPU for Kokoro and Piper. PyTorch+ROCm broken on this AMD chip, so no GPU-accelerated PyTorch TTS on Beast.

Engines tested

Kokoro 82M (ONNX, MIT) — fast, multilingual, preset voices only
Piper (ONNX, MIT) — fastest, native voice library across 30+ languages
Qwen3-TTS 1.7B 8-bit MLX (Apache-2.0) — voice cloning from short reference clip, multilingual
Chatterbox-Turbo (Resemble AI, MIT) — voice cloning + bracket emotion tags
Orpheus 3B (Canopy Labs, Apache-2.0) — 8 preset voices + angle-bracket emotion tags
Higgs Audio v2 GGUF (BosonAI, Apache-2.0) — multilingual + scene-description emotion

Source texts

Dec 6 (English): 884 words of mock-archaic English about Ayutthaya, Thailand — bicycles, ancient Buddha statue in a tree, wild dog chase, Pad Thai.

Dec 9 (German): ~400 words of mock-archaic German about Khao Yai — pool, ruins, Christmas music in tropical heat.

Pattern observations

Reference quality dominates Qwen3-TTS output. Studio audiobook samples (Penguin/Audible) give clean output. Broadcast TV clips with laugh tracks/ambience get cloned along with the voice.
For cloned voices speaking a non-native language: the source speaker's accent transfers. Stephen Fry speaks German with an English accent. For native-sounding output, clone from a native speaker of the target language.
ONNX models on CPU (Kokoro/Piper) destroy Mac MLX on speed even though Mac has GPU acceleration. Smaller architectures + better optimisation win.
Emotion tags genuinely work in Chatterbox and Orpheus — chuckles, sighs, gasps appear at marked points. Higgs's chunk-level scene-desc approach is more abstract but also produces tonal variation.

Coming soon

Rufus Beck (German Harry Potter narrator) as a native-German Qwen clone
LongCat-Video-Avatar (Modal H100) — talking head video generation from a still image + audio clip

Built for Oliver to evaluate. Page auto-deployed via Cloudflare Pages.