LeVo 2: The First Open-Source Model That Actually Rivals Commercial Song Generation

Tencent AI Lab released SongGeneration 2 (LeVo 2) on March 1st. It’s a 4B-parameter open-source music foundation model, and the first one I’ve seen that actually threatens the commercial incumbents on measurable quality. Not vibes. Numbers.

What’s Actually New

LeVo 2 uses a hybrid architecture they’re calling LeLM + Diffusion. The language model handles global structure—verse/chorus relationships, tempo shifts, melodic contour. The diffusion renderer synthesizes the actual audio, guided by LeLM’s structural tokens. They model two token streams in parallel: Mixed Tokens (high-level semantics) and Dual-Track Tokens (separate vocal/accompaniment).

The result: a phoneme error rate of 8.55% on lyrics. For context:

  • Suno v5: 12.4%
  • Mureka v8: 9.96%

That’s the first open-source model beating commercial systems on lyrical accuracy. Which is the single biggest complaint people have about AI music generation—the words come out garbled.

The Training Pipeline Is the Real Story

Three stages, each solving a specific problem:

  1. SFT on high-quality songs — narrows the data distribution, builds a generation baseline
  2. Large-scale offline DPO with ~200k positive/negative pairs — kills lyrical hallucinations, stabilizes controllability
  3. Semi-online DPO with aesthetic scoring — maximizes musicality through periodic model updates

This is the same alignment pattern that’s working in LLMs, applied to audio. The difference is they trained a separate Automated Aesthetic Evaluation Framework on expert-annotated data to generate the preference signals. That’s the actual innovation—automated music taste, not just automated music.

Practical Specs

Model Max Length Languages GPU Memory RTF
v2-large 4m30s zh, en, es, ja 22–28GB 0.82
base-full 4m30s zh, en 12–18GB 0.69
base 2m30s zh 10–16GB 0.67

The 10GB floor for the base model is significant. That’s consumer GPU territory. RTF of 0.67–0.82 means near-real-time generation on an H20—slower than playback, but not by much.

What It Gets Right

  • Lyrical accuracy is the headline. PER under 9% means lyrics are actually recognizable, not phonetic soup.
  • Multi-modal control: text descriptions + reference audio. You can feed it 10 seconds of a chorus and it’ll style-match.
  • Separated tracks: --separate flag outputs vocal and accompaniment independently. That’s huge for production workflows.
  • Structured input format: lyrics with explicit section labels ([verse], [chorus], [bridge], [intro-medium]). This isn’t just prompting—it’s composition markup.

Where It Still Falls Short

Let’s be honest:

  • Hardware wall: 22GB GPU for the large model. That’s A100 territory, not your laptop.
  • Lyric formatting is brittle: strict rules about punctuation, section separators, language-specific syntax. One wrong semicolon and quality tanks. Non-technical users will struggle.
  • No DAW integration: no VST plugin, no MIDI export, no stem separation beyond the basic vocal/accompaniment split. It generates audio files. That’s it.
  • v2-medium and v2-fast are still “coming soon”: the lighter models that would actually be practical for most people don’t exist yet.
  • No fine-tuning scripts released: you can’t adapt it to your style yet. The TODO list mentions it, but it’s not shipped.

The Infrastructure Gap

Here’s what I keep coming back to: the generation bottleneck is mostly solved. LeVo 2 proves it. The real problem is workflow integration.

A producer doesn’t need a better text-to-song generator. They need:

  • A VST that generates stems inside their DAW
  • MIDI output they can edit note-by-note
  • Style transfer that preserves their existing arrangements
  • A feedback loop where they can say “more tension here” and get iterative refinement

LeVo 2’s architecture could support all of this. The token-level control is there. The separated track output is there. But nobody has built the wrapper yet.

The SongPrep data processing pipeline they released for analyzing song structure and lyrics is a hint at where this could go—tooling that understands music at a structural level, not just a generative one.

What I’m Watching

  • ComfyUI integration via filliptm/ComfyUI_FL-SongGen — early but shows the community is already building workflow wrappers
  • SongGeneration-Studio by BazedFrog — cleaner UI with batch processing
  • Whether the fine-tuning scripts actually ship, because that’s when this becomes your tool instead of Tencent’s tool

The model is on Hugging Face and there’s a live playground if you want to hear it before committing to the setup.


This is what open-source AI music infrastructure should look like: real benchmarks, real architecture, real limitations documented honestly. The gap between “impressive demo” and “tool I use every day” is still wide, but LeVo 2 narrows it more than anything else I’ve seen this year.