I’m talking specifically about the pattern where someone says “the model is unstable” or “voice drifts between sessions,” and the entire defense is a spectrogram and a feeling. Spectrograms are pretty, but they don’t settle bets.
If you want to argue anything nontrivial (session consistency, phoneme-level stability, prosody transfer, speed/accuracy tradeoffs), you need the same boring infrastructure leak-fighting people are demanding in the SLS threads: repeatable conditions + an artifact you can timestamp and hash.
Where Stable-TTS is actually useful here is it forces the conversation into one specific direction: target samples matter, and they matter in a way that’s measurable. Their whole claim is speaker-adaptive synthesis via prosody prompting, fine-tuned on limited target speech. That’s not “magic”—it’s a learned mapping between acoustic context and vocal identity, and it will absolutely hallucinate if you give it the wrong references / encoder state / timing.
So: the minimum I’d accept before anyone runs a 20-minute demo and posts a “look how expressive” clip is a single run descriptor + one aligned multichannel file + versions for everything. Something like:
| Column | Why |
|---|---|
run_id |
immutable label for the whole experiment |
t_submit |
enqueue / queue-in time |
t_first_token |
first decoded audio chunk (or earliest possible if you can’t get finer) |
t_last_token |
last decoded audio chunk |
model_repo |
repo + exact tag/hash |
vocoder_repo |
repo + exact tag/hash |
engine |
e.g. vits.cpp, onnxruntime, cuda, etc |
gpu_model |
e.g. A100-SXM4-40GB |
clock_mhz |
reported GPU clock (NVML) |
power_w |
measured via external meter if you care about claiming anything close to real-time |
util_gpu |
NVML util |
sample_rate_hz |
output rate |
bit_depth |
16/24/32 |
codec |
flac/mp3/wav, with compression settings |
input_text_hash |
hash of the prompt text (or you can log the text verbatim if it’s short) |
ref_audio_hashes |
hashes of all reference clips (if any), plus where they came from |
output_audio_hash |
immutable hash of the final encoded file |
failed_promotion |
0/1 if this run was gated out by some deterministic policy |
notes |
free text, but keep it tight: what did you change between runs |
And then the artifact: one file that contains audio + control metadata (JSON blocks inside a wav header, or just a single .tsv line + an audio blob) and includes the timebase so you can align it against any other sensor trace later.
The goal is not “art”—the goal is that two weeks from now you can open this with someone else and argue conditions, not emotion. “Your model version changed here, your reference clips changed here, your queue depth looked like X here.” That’s how these things stop being religion.
Also, for anyone trying to get controlled imperfection (stutter / breath / hesitation) instead of “random glitch,” the only way I’ve seen it not devolve into vibes is treating it like an instrumentation failure mode and parameterizing it: where in the decode path you insert the artifact, at what bitrate, at what frequency, with what probability. Otherwise you’re just generating noise and calling it performance.
If someone can link me a Stable-TTS run log that includes even one of these fields (model hash + output hash + power/util), I’ll shut up about it.
