TTS Comparison β local AI voice models, Mac M4 + AMD Strix Halo
Same archaic-English passage read by 8 different TTS engines. Same German passage read by 7. Same English text with inline emotion tags rendered by 3 emotion-capable engines.
English β "The 6th Day of December, Anno 2025"
884 words. Mock-archaic English about a day in Ayutthaya: bicycles, ruins, dogs, food.
Speed scale:
β€0.1 RTF 10Γ+ realtime Β·
β€1.0 RTF realtime or better Β·
1-2 RTF slower than realtime Β·
2+ RTF much slower
The audio (8 voices)
Piper β Northern English MaleFASTEST
Beast (gfx1151)
Engine:Piper en_GB-northern (ONNX medium)Voice:Preset, Northern English maleGen time:4.7sAudio:252sSpeed:53Γ realtime
VITS architecture. Older but still solid. Northern accent variant.
Piper β Alan
Beast (gfx1151)
Engine:Piper en_GB-alan (ONNX medium)Voice:Preset, British maleGen time:5.5sAudio:310sSpeed:56Γ realtime
Standard British male voice. Fast and intelligible. Sounds slightly mechanical.
Kokoro β bm_george
Beast (gfx1151)
Engine:Kokoro-82M (ONNX)Voice:Preset, British maleGen time:31sAudio:300sSpeed:10Γ realtime
Oliver's verdict: "quite good." Default British male β clear and warm.
Kokoro β bm_lewis
Beast (gfx1151)
Engine:Kokoro-82M (ONNX)Voice:Preset, British male altGen time:31sAudio:307sSpeed:10Γ realtime
Alternate British male voice. Same Kokoro engine.
Kokoro β bf_emma
Beast (gfx1151)
Engine:Kokoro-82M (ONNX)Voice:Preset, British femaleGen time:27sAudio:275sSpeed:10Γ realtime
Female variant. Same Kokoro engine.
Qwen3-TTS β Stephen FryPAUL'S PICK
Mac M4 MLX
Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX)Voice:Cloned from 25s of Penguin Audiobooks Mythos sampleGen time:187sAudio:290sSpeed:1.5Γ slower than realtime
Studio-quality reference clip = clean output. Voice cloning is Qwen's killer feature β match any speaker from a short reference.
Qwen3-TTS β Humphries
Mac M4 MLX
Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX)Voice:Cloned, posh British maleGen time:179sAudio:276sSpeed:1.5Γ slower than realtime
Oliver: "sounds like in a hall or room." Reference clip had studio ambience that got cloned along with the voice.
Higgs Audio v2 β default
Mac CPU (GGUF Q6_K)
Engine:Higgs Audio v2 GGUF Q6_K (BosonAI)Voice:Higgs default (no clone in this test)Gen time:106sAudio:62sSpeed:1.7Γ slower than realtime
Mood steered via scene-description per chunk. CPU-only path on Mac.
π View the English script
German β "Der 9. Tag des Dezember, Anno 2025"
~400 words. Mock-archaic German about a day in Khao Yai: bicycles, ruins, swimming pool, German Christmas music in tropical heat.
Native vs cloned: Piper voices are recorded by native German speakers β no English accent. Qwen-cloned voices carry the original speaker's accent (Stephen Fry speaks German with an English accent β Alberto flagged this).
Native German voices (recommended)
Qwen3-TTS β Rufus BeckNEW Β· NATIVE NARRATOR
Mac M4 MLX
Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX)Voice:Cloned Rufus Beck β German Harry Potter audiobook narrator (Random House Audio, official sample)Gen time:74sAudio:~125sSpeed:~1.7Γ slower than realtime
Native German via voice cloning. Rufus Beck is the German equivalent of Stephen Fry β narrated all 7 Harry Potter books in German + many other audiobooks. Clean studio source. Compare to Piper Thorsten below for ONNX vs cloned-narrator quality.
Piper β ThorstenGOLD STANDARD
Beast (gfx1151)
Engine:Piper de_DE-thorsten (ONNX medium)Voice:Native German male (Thorsten MΓΌller β the famous thorsten-voice dataset)Gen time:2.7sAudio:109sSpeed:40Γ realtime
The reference quality for free German TTS. Native speaker, studio recording. No accent.
Piper β Ramona
Beast (gfx1151)
Engine:Piper de_DE-ramona (ONNX low)Voice:Native German femaleGen time:2.3sAudio:127sSpeed:55Γ realtime
Engine:Piper de_DE-eva_k (ONNX x-low)Voice:Native German female altGen time:2.3sAudio:143sSpeed:62Γ realtime
Smallest model size. Tinny relative to Thorsten but native German.
Cloned English voices speaking German (English accent expected)
Qwen3-TTS β Stephen Fry
Mac M4 MLX
Engine:Qwen3-TTS-12Hz-1.7B (8-bit MLX)Voice:Cloned Stephen Fry speaking GermanGen time:73sAudio:126sSpeed:1.7Γ slower than realtime
English speaker speaking German = English accent. Novelty value; not the right choice for actual German content. For native German via Qwen we need a German reference clip (Rufus Beck coming soon).
Same Dec 6 English text re-rendered with inline emotion tags. Tag format differs per engine. Listen for chuckles, sighs, breathless urgency at the marked points.
Tag formats: Chatterbox uses square brackets [chuckle] Β· Orpheus uses angle brackets <chuckle> Β· Higgs uses chunk-level scene descriptions (no inline tags)
Chatterbox-Turbo β Stephen FryREALTIME
Mac M4 MPS
Engine:Chatterbox-Turbo (Resemble AI, MIT)Voice:Cloned Stephen FryTags:[chuckle][sigh][whispers][breathless][satisfied]Gen time:115sAudio:120sSpeed:0.96 (~realtime)
Native bracket-tag support like ElevenLabs v3. Combines voice cloning + emotion. Beat ElevenLabs in blind tests (65.3% vs 24.5%) per Resemble's benchmarks.
Orpheus 3B β dan
Mac CPU Metal
Engine:Orpheus 3B (Canopy Labs, Apache-2.0) via orpheus-cppVoice:Preset "dan" (male English, no cloning)Tags:<chuckle><sigh><gasp>Gen time:124sAudio:71sSpeed:1.75Γ slower than realtime
3B llama-based architecture. Native emotion tag support via <laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>. 8 preset voices (no cloning).
Higgs Audio v2 β scene-desc
Mac CPU (GGUF Q6_K)
Engine:Higgs Audio v2 GGUF Q6_KVoice:Higgs defaultTags:Scene-desc per chunk (no inline tags)Gen time:106sAudio:62sSpeed:1.7Γ slower than realtime
No inline brackets β instead the system prompt sets the mood per text chunk. E.g. chunk 1 system prompt: "narrator with excited energetic tone"; chunk 4: "tension and breathless urgency." Listen for whether mood shifts between segments.
We did venture forth upon our bicycles to survey two magnificent ruins of antiquity, the first containing a most wondrous sight β the head of Buddha himself, grown about and embraced by the roots of a great tree, as if Nature herself did worship at his feet. The heat upon this day was most oppressive, the sun beating down with such fury that the very air registered some thirty-two degrees [sigh], and the perspiration did flow freely from our brows as rivers flow to the sea.
Paragraph 3 β Tender moment
[whispers] Before this mystical threshold sat a small child, a girl of tender years, who did endeavour to sell little fishes of her own crafting; I did press twenty baht into her small hand, though I purchased naught of her wares, for charity is a virtue most pleasing to Providence.
Paragraph 4 β Wild dog chase
Yet this day was not without its terrors, for I myself was pursued by a wild dog for no less than fifty metres along the road [breathless], the animal barking most ferociously at my heels, and this experience did quite diminish our enthusiasm for further cycling adventures.
Paragraph 5 β Resolution
[satisfied] At a halal establishment we did feast most excellently upon seafood and shrimps prepared in a sauce of medium spiciness, and thus we concluded this day of adventures, misadventures, ancient wonders, and the ever-present kindness of strangers, returning to our rest with grateful hearts and the promise of new discoveries upon the morrow [chuckle].
Note: Chatterbox + Orpheus interpret these brackets/angle-brackets natively. Higgs uses chunk-level mood descriptions instead β paragraph 1 prompt was "excited energetic," paragraph 4 was "tension and breathless urgency," paragraph 5 was "warm satisfaction."
Video Avatar β LongCat-Video-Avatar on H100
Talking-head video generation from a still photo + voice clip. 13.6B-parameter model from MeiGen-AI / Meituan. Runs on cloud H100 via Modal (cannot run locally β no GPU on Mac/Beast can handle it).
Inputs
Reference photo (Wikipedia / Berlinale 2024, 2192Γ2908 resized to 512px)
Driving audio (first 10s of Qwen3-TTS Stephen Fry rendering of Dec 6 text)
Prompt: "A distinguished British gentleman speaking with warmth and enthusiasm"
Output
Resolution:544Γ736 @ 16 fpsDuration:5.8sGPU:NVIDIA H100 80GB via Modal serverlessGen time:~5 min wall clock (incl. cold start)Cost:~$0.30-0.50 per run
What's actually happening
LongCat-Video-Avatar takes a single still image of a face, an audio clip of speech, and a text prompt describing the scene/mood. The 13.6B-parameter diffusion model produces a video where the still face is animated to lip-sync the audio with believable facial expressions, head movement, and micro-gestures.
Why this needs cloud GPU
Beast can't run it: model uses NVIDIA flash-attention-2 kernels (no ROCm port) AND block-sparse attention via Triton (CUDA-only). PyTorch+ROCm is also broken on the AMD gfx1151 chip.
Mac can't run it either: 13.6B params + custom CUDA Triton kernels = no realistic path to Metal/MPS without weeks of porting work.
Longer audio (whole Dec 6 passage β ~5 min video) β needs num_segments setting for video continuation
Different starting images (Alberto, Diana, the Rufus Beck-cloned German voice in a German face)
720p instead of 480p (longer gen, ~$1 per run)
About this comparison
Hardware
Mac M4 (Apple Silicon) β MLX framework for Qwen3-TTS, MPS for Chatterbox, Metal for Orpheus, CPU for Higgs.
Beast (Beelink GTR9 Pro, AMD Ryzen AI MAX+ 395, Radeon 8060S iGPU gfx1151, 128GB unified RAM, Ubuntu 24.04) β ONNX runtime on CPU for Kokoro and Piper. PyTorch+ROCm broken on this AMD chip, so no GPU-accelerated PyTorch TTS on Beast.
Engines tested
Kokoro 82M (ONNX, MIT) β fast, multilingual, preset voices only
Piper (ONNX, MIT) β fastest, native voice library across 30+ languages
Qwen3-TTS 1.7B 8-bit MLX (Apache-2.0) β voice cloning from short reference clip, multilingual
Dec 6 (English): 884 words of mock-archaic English about Ayutthaya, Thailand β bicycles, ancient Buddha statue in a tree, wild dog chase, Pad Thai.
Dec 9 (German): ~400 words of mock-archaic German about Khao Yai β pool, ruins, Christmas music in tropical heat.
Pattern observations
Reference quality dominates Qwen3-TTS output. Studio audiobook samples (Penguin/Audible) give clean output. Broadcast TV clips with laugh tracks/ambience get cloned along with the voice.
For cloned voices speaking a non-native language: the source speaker's accent transfers. Stephen Fry speaks German with an English accent. For native-sounding output, clone from a native speaker of the target language.
ONNX models on CPU (Kokoro/Piper) destroy Mac MLX on speed even though Mac has GPU acceleration. Smaller architectures + better optimisation win.
Emotion tags genuinely work in Chatterbox and Orpheus β chuckles, sighs, gasps appear at marked points. Higgs's chunk-level scene-desc approach is more abstract but also produces tonal variation.
Coming soon
Rufus Beck (German Harry Potter narrator) as a native-German Qwen clone
LongCat-Video-Avatar (Modal H100) β talking head video generation from a still image + audio clip
Built for Oliver to evaluate. Page auto-deployed via Cloudflare Pages.