UniSonate: Unified speech and music generation with natural language instruction

Abstract

Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce \textbf{UniSonate}, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47\%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe \textit{positive transfer}, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines.

Experiment

This page shows the samples in the paper "UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions".

All test samples and speakers have not appeared in the training set and validation set.

During inference, we employ a straightforward yet effective method to estimate total duration. Since speech rate strongly correlates with both age and emotion, we compute the average duration per phoneme for different age-emotion combinations (e.g., youth-happy) from the training set. At inference time, we perform keyword matching on the instruction text and calculate the total duration by multiplying the number of phonemes by the corresponding average duration. When no keywords match, we default to the global average phoneme duration.

For CosyVoice2, which lacks text-based control for timbre attributes (gender, age), we provide reference audio that matches timbre description during inference and map inputs to its supported short-form control text (e.g., "Please speak very happy").

Note that DiffRhythm+ does not support music synthesis under 90 seconds; therefore, we generate longer sequences and truncate them to create test samples, which may introduce evaluation bias.

Comparing Model Capabilities

Model Architecture

[New Supplementary] TTM - Male English Vocals

Additional samples demonstrating male vocal performance in English, as requested by reviewers.

Instruct Description	Lyrics	UniSonate (Ours)
This classical piece features a male singer in his 30s with a warm, resonant tone. The instrumentation blends piano, strings, and a choir to create a serene and reflective atmosphere. The tempo is moderate with a steady, flowing rhythm in the key of Bb major, evoking a sense of tranquility and introspection, as if transporting the listener to a peaceful, contemplative moment.	As I look in your eyes I hold on to your body
A classical choral piece performed by a mixed choir ranging from young adults to middle-aged singers. The harmonious voices blend seamlessly with gentle harp and string accompaniment. The tempo is moderate with a flowing melody in F major. The music evokes a sense of tranquility, reverence, and solemn reflection, creating a soothing and spiritually uplifting ambiance.	May your days be merry and bright
This rock track is led by a male singer in his 20s with a smooth, melodic tone. The instrumentation includes electric guitar, bass guitar, and drums, set to a moderate tempo of around 85 BPM in F major. The gentle strumming of the guitar combined with the vocals creates a dreamy and reflective atmosphere, evoking feelings of nostalgia and introspection.	You never know what temporal days may bringLaugh love live free and singWhen life is in discord
A pop song featuring a male singer in his 20s with a smooth, melodic voice. The track is built on acoustic and electric guitars along with soft, ambient synths. With a moderate tempo of around 100 BPM in D major, the music feels bright and uplifting yet serene. It creates a warm, inviting atmosphere that evokes nostalgia and comfort, perfect for a moment of quiet introspection.	Would take your breath away Id write it all Even more in love with me youd fall
An energetic electronic dance music (EDM) track performed by a male vocalist. His tone is passionate and lively, driven by synthesizers, drum machines, and electronic beats. Set at a tempo of 128 BPM in Ab major, the pulsating beats and catchy melody create an uplifting atmosphere filled with excitement, euphoria, and celebration.	No I don’t even know if I’m aliveOh, oh, oh without you now
A smooth and soulful pop ballad performed by a male singer in his 30s. The arrangement features piano, synthesizer, and drums, moving at a moderate tempo with a steady beat in Bb major. The soft piano and gentle synth textures create a dreamy, ethereal atmosphere, while the singer's delivery adds a touch of melancholy, evoking deep nostalgia and quiet reflection on a past love.	Once I was seven years old my momma told me

TTS (Text-to-Speech) Task

Instruct Description	Text	UniSonate (Ours)	InstructAudio	CosyVoice2
A young adult female speaks with a neutral emotion. Her delivery is clear and she has a Mandarin accent.	李大钊在敌人的法庭上慷慨陈词毫不畏惧。
The speaker is an elderly woman who uses a measured, neutral tone of voice with an American accent.	I have come to understand. That you were the best daughter you could be.
You'll hear a young adult woman speaking Mandarin in a happy and casual manner.	你写板报的速度可真快。
With a playful style and a happy emotion, a middle-aged American woman is speaking.	There it is. My baby.
An aggressive and angry tone characterizes this audio from a young adult female speaker with a Mandarin accent.	你这种慷他人之慨的做法，让人嗤之以鼻。
A middle-aged woman speaks angrily and aggressively, marked by an American accent.	You stole something from me like a petty little thief.
The audio features a young adult female who sounds sad; she speaks softly with a Mandarin accent.	我的故事过程很美，而结局却满是悲伤。
This recording contains a middle-aged man speaking American English. His emotional state is sad, but his delivery is aggressive.	He didn't die from an asthma attack. He died because she killed him here.
In this audio, we hear a female child speaking with a curious tone and a playful style, using a Mandarin accent.	大人还玩跷跷板？
The speaker is an elderly male who speaks seriously and authoritatively in a Mandarin accent.	我的练习本怎么突然没了？
A female child expresses curiosity in a playful manner, and her speech is delivered with a Mandarin accent.	因此官方此次对 E二升级版融入了更多实用型的功能。
The audio features a single speaker, identified as speaker zero, who is an elderly woman. She speaks in a neutral tone with a measured, deliberate style and has an American accent.	I have come to understand. That you were the best daughter you could be.
In this audio, a middle-aged male speaker can be heard. His tone is happy and his casual speaking style is delivered with an American accent.	What? That's good. Well good luck with that. This is hilarious.

TTM (Text-to-Music) Task

Instruct Description	Lyrics	UniSonate (Ours)	InstructAudio	DiffRhythm+	ACE-Step
This audio features a male singer in his 20s performing a pop song. His soft, melodic tone carries a touch of melancholy and introspection. The instrumentation includes piano, synthesizer, and electric guitar, all set to a moderate tempo of 120 BPM in the key of C major. Together, these elements create a dreamy and reflective atmosphere, evoking a sense of nostalgia and longing.	紧紧拥抱唯一的你无可救药的坚定。
A female vocalist in her late teens or early twenties performs this electronic pop piece with a soft and melodic delivery. The track incorporates synthesizers, drum machines, and a female vocal sample, driven by a moderate 85 BPM rhythm in A minor. A repetitive and catchy synth riff anchors the melody, contributing to a dreamy, nostalgic mood that feels both introspective and gently melancholic, with subtle warmth and sweetness.	Like you to see the things I hide.
With a smooth and melodic vocal tone, a male singer in his 30s performs this pop song. The arrangement features piano and strings accompanying a moderate 120 BPM tempo in F major. The overall mood is reflective and dreamy, conveying a strong sense of nostalgia and longing through its gentle melody.	他们已几万年好像天经地义不容改变。
This electronic pop track is sung by a female vocalist in her twenties, whose tone is both smooth and melodic. Built around synthesizers, a drum machine, and a female vocal sample, the music moves at 100 BPM in the key of E♭ major. The result is a dreamy, ethereal soundscape that feels tranquil and introspective—like floating through a serene, otherworldly environment filled with warmth and emotion.	I was your treasure treasure treasureI was your treasure.
A vibrant and lively pop track is brought to life by a female singer in her late teens to early twenties, whose bright and energetic tone radiates youthful excitement. The instrumentation, featuring a synthesizer, drum machine, and bass guitar, creates a modern and upbeat sound perfect for dancing. This catchy atmosphere of joy and celebration is propelled by a tempo of 128 BPM in the key of D minor.	Wild child Hey hey hey heyGonna be your one only one only one wild child Hey hey hey hey Wild child.
This pop piece evokes a deep sense of nostalgia and melancholy, delivered by a male singer in his 20s with a warm, soothing tone. The instrumentation includes guitar, bass, drums, and a keyboard, all supporting a moderate tempo of around 130 BPM in D minor. The gentle strumming of the guitar and the overall arrangement create a reflective and deeply emotional atmosphere.	真心被摧残心痛像洪水泛滥。
A female vocalist in her twenties performs with a soft and melodic tone in this warm, nostalgic pop song. The gentle strumming of the guitar, combined with the bass and drum, evokes a sense of longing and reflection, as if the singer is reminiscing about past memories. The overall mood is soothing and intimate, carried by a moderate tempo and the harmonious key of G major.	想来想去真正怨差麦当作你真正性格。
Energetic and passionate vocals from a female singer in her 20s define this optimistic pop track. The music features guitar, bass, and drums, driving forward at 134 BPM in the key of C major. The overall feeling is one of youthful energy and urgency, much like a warm summer breeze filled with excitement.	My head is blown away away away away away.
Smooth, melodic vocals from a female singer in her twenties guide this electronic pop track on a dreamy, ethereal journey. The instrumentation, built on synthesizers, a drum machine, and a female vocal sample, creates a sense of floating through a futuristic soundscape at 100 BPM in E♭ major. The overall atmosphere is one of tranquility and introspection, feeling both weightless and emotionally warm.	I was your treasure treasure treasureI was your treasure.

TTA(Text-to-Audio) Task

Instruct Description	UniSonate (Ours)	Stable Audio
Underwater bubbles rise as divers breathe and fish swim silently above the sandy seabed.
Birds chirp while water flows gently, and two people talk quietly in a lush forest setting.
The audio features a melancholic blend of new-age theme music that evokes a sense of introspection and longing.
Crickets chirped steadily while faint mechanical noises hummed, accompanied by relaxed conversation in the dim outdoor ambiance.
Ceramic tray clinks softly as hands assemble it on a workbench; faint background sounds in a quiet indoor space.
Ducks quacked softly while birds chirped, leaves rustled, and distant human voices blended in a tranquil outdoor setting.
A dog’s footsteps and faint jingling of its collar accompany the sound of quiet footsteps on a dimly lit street.
The audio contains an emotional, reverb-heavy track with cinematic strings, simple piano and guitar melodies, and synth pad chords. It's relaxing, calming, and passionate.

Dialogue-TTS Task

Instruct Description	Text	UniSonate (Ours)	InstructAudio
This audio features two speakers with distinct vocal characteristics. The first speaker is a young adult female who speaks with a neutral emotional tone. Her delivery is conversational in style, and she speaks Mandarin with a discernible accent. Meanwhile, the second speaker is a young adult male whose speech is also emotionally neutral. He speaks in a casual manner using a standard American accent.	[speaker id 0] 你觉得这个类型跟你平时的行为一致吗？[speaker id 1] Well, that fits nicely.
This audio features two distinct speakers. The first speaker is a young adult female. Her speech conveys a sad emotion and is delivered in a hesitant manner. She speaks with a discernible Mandarin accent. The second speaker is a young adult male who maintains a neutral emotional tone throughout his delivery. His style of speech is notably clear, and he uses a standard American accent.	[speaker id 0] 我曾经试图放下这些，但每次还是不由自主的比较，心里感觉自己永远不够好。[speaker id 1] This is really the paradox of modern society.
This audio features two distinct speakers. The first speaker is a young adult female. Her delivery is measured in style, and she speaks with a calm emotional tone. A discernible Mandarin accent is present in her speech. We also hear a young adult male. His emotional tone is neutral throughout, and he articulates his words in a clear style. He speaks with a standard American accent.	[speaker id 0] 毕竟，有些情感是需要时间去沉淀的。[speaker id 1] Yeah, maybe I should.
This audio recording features two distinct speakers. The first speaker is a young adult male. He speaks in a neutral emotional tone and maintains a casual style of delivery. His accent is standard American. We also hear a young adult female. Her speech carries a calm emotional quality and is delivered in a very natural manner. She speaks with a clear Mandarin accent.	[speaker id 0] It's still waiting until the last minute. [speaker id 1] 我也是啊，尤其是任务太多的时候。
This audio features a dialogue between two speakers. The first speaker is a young adult female who speaks with an angry and aggressive tone in a Mandarin accent. In contrast, the second speaker is a young adult male whose delivery is calm and measured, also speaking with a Mandarin accent.	[speaker id 0] 我先坐下了，哪里都有你。[speaker id 1] 我刚才只是起来帮老人让个座。
A young adult woman delivers her lines in an angry, aggressive style, marked by a Mandarin accent. Her speech is countered by a young adult man who maintains a calm and measured tone, also speaking with a Mandarin accent.	[speaker id 0] 根本不顾及我的需求。[speaker id 1] 我知道你不喜欢，但我也没有别的办法。空调一直开，真的很浪费。
The audio clip contains a conversation where a young adult man speaks in a neutral and clear manner with a Mandarin accent. He is followed by a young adult woman who responds with a calm and measured delivery in the same accent.	[speaker id 0] 主动加个好友，至少能让更多人认识你。[speaker id 1] 我宁愿慢慢认识。
A young adult female speaks in a calm and clear tone, using a Mandarin accent. She is joined by a young adult male who contributes with a neutral and clear speaking style, also in Mandarin.	[speaker id 0] 感觉这样的活动本身就很有意义。[speaker id 1] 对，而且学到了很多动手技巧。