TTS Comparison β€” local AI voice models, Mac M4 + AMD Strix Halo

Same archaic-English passage read by 8 different TTS engines. Same German passage read by 7. Same English text with inline emotion tags rendered by 3 emotion-capable engines.

English β€” "The 6th Day of December, Anno 2025"

884 words. Mock-archaic English about a day in Ayutthaya: bicycles, ruins, dogs, food.

Speed scale: ≀0.1 RTF 10Γ—+ realtime Β· ≀1.0 RTF realtime or better Β· 1-2 RTF slower than realtime Β· 2+ RTF much slower

The audio (8 voices)

Piper β€” Northern English MaleFASTEST
Beast (gfx1151)
Engine:Piper en_GB-northern (ONNX medium) Voice:Preset, Northern English male Gen time:4.7s Audio:252s Speed:53Γ— realtime
VITS architecture. Older but still solid. Northern accent variant.
Piper β€” Alan
Beast (gfx1151)
Engine:Piper en_GB-alan (ONNX medium) Voice:Preset, British male Gen time:5.5s Audio:310s Speed:56Γ— realtime
Standard British male voice. Fast and intelligible. Sounds slightly mechanical.
Kokoro β€” bm_george
Beast (gfx1151)
Engine:Kokoro-82M (ONNX) Voice:Preset, British male Gen time:31s Audio:300s Speed:10Γ— realtime
Oliver's verdict: "quite good." Default British male β€” clear and warm.
Kokoro β€” bm_lewis
Beast (gfx1151)
Engine:Kokoro-82M (ONNX) Voice:Preset, British male alt Gen time:31s Audio:307s Speed:10Γ— realtime
Alternate British male voice. Same Kokoro engine.
Kokoro β€” bf_emma
Beast (gfx1151)
Engine:Kokoro-82M (ONNX) Voice:Preset, British female Gen time:27s Audio:275s Speed:10Γ— realtime
Female variant. Same Kokoro engine.
Qwen3-TTS β€” Stephen FryPAUL'S PICK
Mac M4 MLX
Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX) Voice:Cloned from 25s of Penguin Audiobooks Mythos sample Gen time:187s Audio:290s Speed:1.5Γ— slower than realtime
Studio-quality reference clip = clean output. Voice cloning is Qwen's killer feature β€” match any speaker from a short reference.
Qwen3-TTS β€” Humphries
Mac M4 MLX
Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX) Voice:Cloned, posh British male Gen time:179s Audio:276s Speed:1.5Γ— slower than realtime
Oliver: "sounds like in a hall or room." Reference clip had studio ambience that got cloned along with the voice.
Higgs Audio v2 β€” default
Mac CPU (GGUF Q6_K)
Engine:Higgs Audio v2 GGUF Q6_K (BosonAI) Voice:Higgs default (no clone in this test) Gen time:106s Audio:62s Speed:1.7Γ— slower than realtime
Mood steered via scene-description per chunk. CPU-only path on Mac.
πŸ“– View the English script

German β€” "Der 9. Tag des Dezember, Anno 2025"

~400 words. Mock-archaic German about a day in Khao Yai: bicycles, ruins, swimming pool, German Christmas music in tropical heat.

Native vs cloned: Piper voices are recorded by native German speakers β€” no English accent. Qwen-cloned voices carry the original speaker's accent (Stephen Fry speaks German with an English accent β€” Alberto flagged this).

Native German voices (recommended)

Qwen3-TTS β€” Rufus BeckNEW Β· NATIVE NARRATOR
Mac M4 MLX
Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX) Voice:Cloned Rufus Beck β€” German Harry Potter audiobook narrator (Random House Audio, official sample) Gen time:74s Audio:~125s Speed:~1.7Γ— slower than realtime
Native German via voice cloning. Rufus Beck is the German equivalent of Stephen Fry β€” narrated all 7 Harry Potter books in German + many other audiobooks. Clean studio source. Compare to Piper Thorsten below for ONNX vs cloned-narrator quality.
Piper β€” ThorstenGOLD STANDARD
Beast (gfx1151)
Engine:Piper de_DE-thorsten (ONNX medium) Voice:Native German male (Thorsten MΓΌller β€” the famous thorsten-voice dataset) Gen time:2.7s Audio:109s Speed:40Γ— realtime
The reference quality for free German TTS. Native speaker, studio recording. No accent.
Piper β€” Ramona
Beast (gfx1151)
Engine:Piper de_DE-ramona (ONNX low) Voice:Native German female Gen time:2.3s Audio:127s Speed:55Γ— realtime
Friendly female native German. Quality slightly below Thorsten ("low" variant).
Piper β€” Eva K
Beast (gfx1151)
Engine:Piper de_DE-eva_k (ONNX x-low) Voice:Native German female alt Gen time:2.3s Audio:143s Speed:62Γ— realtime
Smallest model size. Tinny relative to Thorsten but native German.

Cloned English voices speaking German (English accent expected)

Qwen3-TTS β€” Stephen Fry
Mac M4 MLX
Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX) Voice:Cloned Stephen Fry speaking German Gen time:73s Audio:126s Speed:1.7Γ— slower than realtime
English speaker speaking German = English accent. Novelty value; not the right choice for actual German content. For native German via Qwen we need a German reference clip (Rufus Beck coming soon).
Qwen3-TTS β€” Humphries
Mac M4 MLX
Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX) Voice:Cloned "humphries" (British) speaking German Gen time:~70s Audio:~140s Speed:~1.7Γ— realtime
Same issue β€” British speaker reading German with British inflection.
πŸ“– View the German script

Emotion Tags β€” ElevenLabs-style inline expression

Same Dec 6 English text re-rendered with inline emotion tags. Tag format differs per engine. Listen for chuckles, sighs, breathless urgency at the marked points.

Tag formats: Chatterbox uses square brackets [chuckle] Β· Orpheus uses angle brackets <chuckle> Β· Higgs uses chunk-level scene descriptions (no inline tags)
Chatterbox-Turbo β€” Stephen FryREALTIME
Mac M4 MPS
Engine:Chatterbox-Turbo (Resemble AI, MIT) Voice:Cloned Stephen Fry Tags:[chuckle] [sigh] [whispers] [breathless] [satisfied] Gen time:115s Audio:120s Speed:0.96 (~realtime)
Native bracket-tag support like ElevenLabs v3. Combines voice cloning + emotion. Beat ElevenLabs in blind tests (65.3% vs 24.5%) per Resemble's benchmarks.
Orpheus 3B β€” dan
Mac CPU Metal
Engine:Orpheus 3B (Canopy Labs, Apache-2.0) via orpheus-cpp Voice:Preset "dan" (male English, no cloning) Tags:<chuckle> <sigh> <gasp> Gen time:124s Audio:71s Speed:1.75Γ— slower than realtime
3B llama-based architecture. Native emotion tag support via <laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>. 8 preset voices (no cloning).
Higgs Audio v2 β€” scene-desc
Mac CPU (GGUF Q6_K)
Engine:Higgs Audio v2 GGUF Q6_K Voice:Higgs default Tags:Scene-desc per chunk (no inline tags) Gen time:106s Audio:62s Speed:1.7Γ— slower than realtime
No inline brackets β€” instead the system prompt sets the mood per text chunk. E.g. chunk 1 system prompt: "narrator with excited energetic tone"; chunk 4: "tension and breathless urgency." Listen for whether mood shifts between segments.

Where the emotion tags landed in the text

Paragraph 1 β€” Anticipation

[excited] This day did I awake with great anticipation, for our kind hosts at the lodging house had made available unto us two wheeled velocipedes, that we might explore these ancient lands with greater liberty. Diana and I did first repair to the establishment known as Busaba CafΓ©, wherein we partook of most delectable waffles adorned with ices of coconut and matcha [chuckle], the latter being a verdant preparation of the Orient that doth refresh the palate most wonderfully.

Paragraph 2 β€” Heat and oppression

We did venture forth upon our bicycles to survey two magnificent ruins of antiquity, the first containing a most wondrous sight β€” the head of Buddha himself, grown about and embraced by the roots of a great tree, as if Nature herself did worship at his feet. The heat upon this day was most oppressive, the sun beating down with such fury that the very air registered some thirty-two degrees [sigh], and the perspiration did flow freely from our brows as rivers flow to the sea.

Paragraph 3 β€” Tender moment

[whispers] Before this mystical threshold sat a small child, a girl of tender years, who did endeavour to sell little fishes of her own crafting; I did press twenty baht into her small hand, though I purchased naught of her wares, for charity is a virtue most pleasing to Providence.

Paragraph 4 β€” Wild dog chase

Yet this day was not without its terrors, for I myself was pursued by a wild dog for no less than fifty metres along the road [breathless], the animal barking most ferociously at my heels, and this experience did quite diminish our enthusiasm for further cycling adventures.

Paragraph 5 β€” Resolution

[satisfied] At a halal establishment we did feast most excellently upon seafood and shrimps prepared in a sauce of medium spiciness, and thus we concluded this day of adventures, misadventures, ancient wonders, and the ever-present kindness of strangers, returning to our rest with grateful hearts and the promise of new discoveries upon the morrow [chuckle].

Note: Chatterbox + Orpheus interpret these brackets/angle-brackets natively. Higgs uses chunk-level mood descriptions instead β€” paragraph 1 prompt was "excited energetic," paragraph 4 was "tension and breathless urgency," paragraph 5 was "warm satisfaction."

Video Avatar β€” LongCat-Video-Avatar on H100

Talking-head video generation from a still photo + voice clip. 13.6B-parameter model from MeiGen-AI / Meituan. Runs on cloud H100 via Modal (cannot run locally β€” no GPU on Mac/Beast can handle it).

Inputs

Reference photo (Wikipedia / Berlinale 2024, 2192Γ—2908 resized to 512px)

Driving audio (first 10s of Qwen3-TTS Stephen Fry rendering of Dec 6 text)

Prompt: "A distinguished British gentleman speaking with warmth and enthusiasm"

Output

Resolution:544Γ—736 @ 16 fps Duration:5.8s GPU:NVIDIA H100 80GB via Modal serverless Gen time:~5 min wall clock (incl. cold start) Cost:~$0.30-0.50 per run

What's actually happening

LongCat-Video-Avatar takes a single still image of a face, an audio clip of speech, and a text prompt describing the scene/mood. The 13.6B-parameter diffusion model produces a video where the still face is animated to lip-sync the audio with believable facial expressions, head movement, and micro-gestures.

Why this needs cloud GPU

Next experiments

About this comparison

Hardware

Mac M4 (Apple Silicon) β€” MLX framework for Qwen3-TTS, MPS for Chatterbox, Metal for Orpheus, CPU for Higgs.

Beast (Beelink GTR9 Pro, AMD Ryzen AI MAX+ 395, Radeon 8060S iGPU gfx1151, 128GB unified RAM, Ubuntu 24.04) β€” ONNX runtime on CPU for Kokoro and Piper. PyTorch+ROCm broken on this AMD chip, so no GPU-accelerated PyTorch TTS on Beast.

Engines tested

Source texts

Dec 6 (English): 884 words of mock-archaic English about Ayutthaya, Thailand β€” bicycles, ancient Buddha statue in a tree, wild dog chase, Pad Thai.

Dec 9 (German): ~400 words of mock-archaic German about Khao Yai β€” pool, ruins, Christmas music in tropical heat.

Pattern observations

Coming soon

Built for Oliver to evaluate. Page auto-deployed via Cloudflare Pages.