AudioCALM — Continuous Autoregressive Language Modeling for Universal Audio Generation

Abstract

Unifying speech, sound, and music generation in one model is hindered by tradeoffs between fidelity, end-to-end training, in-context conditioning, and variable-length synthesis that no current paradigm fully resolves. To address this challenge, we present AudioCALM, a universal audio generation framework that extends autoregressive (AR) next-token prediction from discrete tokens to continuous audio latents: a thin flow-matching head replaces the softmax to predict rectified-flow velocities at each position, and a block-causal AR-Flow attention pattern produces arbitrary-length output. Joint training of multiple audio generation tasks faces an asymmetric text–audio mismatch: speech transcripts align to specific time spans and demand tight, time-aligned attention, whereas sound and music captions describe only overall semantics and rely on diffuse, holistic attention; mixing the two disproportionately degrades sound and music generation. We address this asymmetry at two levels: a data reformulation strategy that unifies all three tasks under a single description-style conditioning interface, and a novel architecture Asymmetric Mixture-of-Modality-Experts (A-MoME), which adds a dedicated residual expert for speech while sound and music share the backbone, incurring no inference overhead on non-speech inputs. Experimental results demonstrate that AudioCALM matches modality-specific state-of-the-art and outperforms prior unified baselines on speech, sound, and music generation benchmarks.

1Method

A single causal Transformer autoregresses over fixed-size blocks of continuous audio latents. A flow-matching head replaces the softmax; a block-causal AR-Flow attention pattern enables streaming arbitrary-length output; an asymmetric speech-only residual expert captures the tighter alignment that speech alone demands.

AudioCALM architecture overview — **Figure 1.** A causal Transformer autoregresses over blocks of continuous audio latents; a flow-matching head and a stop head are attached to its hidden states, and each block is produced by iterative denoising with KV-cache reuse. The block-causal AR-Flow mask is causal across blocks and bidirectional within each block. A-MoME adds a deterministically routed speech-only FFN to the shared backbone.

iContinuous AR LM

A flow-matching head replaces the softmax to predict rectified-flow velocities over VAE latents — keeping the LM interface while removing the codec bottleneck.

iiBlock-Causal AR-Flow

Causal across blocks, bidirectional within each block. Streams arbitrary-length audio under a single mask shared by training and inference.

iiiA-MoME

One residual FFN added on speech positions only — capacity follows the asymmetric text–audio mismatch with zero overhead on non-speech inputs.

2Zero-Shot Text-to-Speech Comparisons on LibriTTS

A reference utterance is prepended as the speaker prompt. AudioCALM is compared quantitatively against modality-specific TTS systems (F5-TTS, CosyVoice 3.0) and unified baselines (UniAudio, UniMoE-Audio, UniFlow-Audio, Ming-omni); full numbers are in the paper and summarized in the Results section below. Each card highlights the strongest baseline in each group — CosyVoice 3.0 (modality-specific) and Ming-omni (unified) — alongside Ground Truth and AudioCALM; the full baseline lineup is one click away under Compare baselines.

3Text-to-Sound Comparisons on AudioCaps

Held-out captions covering everyday acoustic events. AudioCALM is compared quantitatively against AudioLDM 2-Large, TangoFlux, Stable Audio Open, and the unified baselines (UniAudio, UniFlow-Audio, Ming-omni); full numbers are in the paper and summarized in the Results section below. Each card highlights the strongest baseline in each group — TangoFlux (modality-specific) and Ming-omni (unified) — alongside Ground Truth and AudioCALM; the full baseline lineup is one click away under Compare baselines.

4Text-to-Music Comparisons on Song-Describer

Held-out music captions spanning multiple genres. AudioCALM is compared quantitatively against MusicGen-Large, Stable Audio Open, and the unified baselines (UniAudio, UniMoE-Audio, UniFlow-Audio, Ming-omni); full numbers are in the paper and summarized in the Results section below. Each card highlights the strongest baseline in each group — Stable Audio Open (modality-specific) and UniMoE-Audio (unified) — alongside Ground Truth and AudioCALM; the full baseline lineup is one click away under Compare baselines.

5Results

Best per column in bold, second-best underlined. AudioCALM is a single model evaluated jointly across all three domains.

Metrics. WER (word error rate, lower is better) and SIM (cosine similarity between speaker embeddings of the generation and the reference, higher is better) measure speech intelligibility and speaker fidelity, respectively. FAD (Fréchet audio distance, lower is better) measures distributional fidelity. CLAP (text–audio cosine similarity, higher is better) measures prompt adherence. MOS is the subjective overall-quality score for speech (1–5, higher is better); MOS-Q and MOS-T are the subjective quality and text-relevance scores for sound and music (1–5, higher is better).

Text-to-speech

Model	Benchmark A			Benchmark B
Model	WER ↓	SIM ↑	MOS ↑	WER ↓	SIM ↑	MOS ↑
Modality-specific baselines
F5-TTS	0.033	0.616	3.85	0.018	0.648	3.78
CosyVoice 3.0	0.022	0.697	3.96	0.015	0.695	3.88
Unified baselines
UniAudio	0.120	0.265	3.30	0.113	0.363	3.22
UniMoE-Audio	0.078	0.361	3.52	0.019	0.573	3.72
UniFlow-Audio	0.032	0.570	3.50	0.058	0.573	3.45
Ming-omni	0.025	0.553	3.82	0.013	0.633	3.80
AudioCALM (ours)	0.020	0.668	4.02	0.011	0.672	3.95

Text-to-sound and text-to-music

Model	Text-to-Sound				Text-to-Music
Model	FAD ↓	CLAP ↑	MOS-Q ↑	MOS-T ↑	FAD ↓	CLAP ↑	MOS-Q ↑	MOS-T ↑
Modality-specific baselines
AudioLDM 2-Large	5.36	0.22	3.25	3.10	—	—	—	—
TangoFlux	2.70	0.36	3.82	3.85	—	—	—	—
Stable Audio Open	4.13	0.25	3.65	3.45	2.23	0.32	3.95	3.85
MusicGen-Large	—	—	—	—	5.28	0.19	3.65	3.45
Unified baselines
UniAudio	6.64	0.13	3.20	2.95	11.25	0.06	2.80	2.65
UniMoE-Audio	—	—	—	—	3.71	0.22	3.80	3.60
UniFlow-Audio	4.22	0.35	3.62	3.80	6.39	0.15	3.45	3.25
Ming-omni	2.46	0.27	3.85	3.60	7.98	0.07	3.25	2.92
AudioCALM (ours)	1.95	0.37	3.98	3.95	2.02	0.36	3.99	3.92

Ethics

AudioCALM supports zero-shot voice cloning and high-fidelity sound and music synthesis. Deployment requires provenance and consent safeguards against voice-cloning misuse and synthetic-media risks. All audio samples on this page are released solely for the purpose of academic peer review.

AudioCALM Continuous Autoregressive Language Modeling for Universal Audio Generation