InstructAudio: Unified speech and music generation with natural language instruction

Abstract

Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion, style, accent), and musical (genre, instrument, rhythm, atmosphere). It supports expressive speech, music, and dialogue generation in English and Chinese. The model employs joint and single diffusion transformer layers with a standardized instruction-phoneme input format, trained on 50K hours of speech and 20K hours of music data, enabling multi-task learning and cross-modal alignment. Fig. 1 visualizes performance comparisons with mainstream TTS and TTM models, demonstrating that InstructAudio achieves optimal results on most metrics. To our best knowledge, InstructAudio represents the first instruction-controlled framework unifying speech and music generation.

Experiment

This page shows the samples in the paper "InstructAudio: Unified speech and music generation with natural language instruction".

All test samples and speakers have not appeared in the training set and validation set.

During inference, we employ a straightforward yet effective method to estimate total duration. Since speech rate strongly correlates with both age and emotion, we compute the average duration per phoneme for different age-emotion combinations (e.g., youth-happy) from the training set. At inference time, we perform keyword matching on the instruction text and calculate the total duration by multiplying the number of phonemes by the corresponding average duration. When no keywords match, we default to the global average phoneme duration.

For CosyVoice2, which lacks text-based control for timbre attributes (gender, age), we provide reference audio that matches timbre description during inference and map inputs to its supported short-form control text (e.g., "Please speak very happy").

Note that DiffRhythm+ does not support music synthesis under 90 seconds; therefore, we generate longer sequences and truncate them to create test samples, which may introduce evaluation bias.

Comparing Model Capabilities

Model Architecture

TTS (Text-to-Speech) Task

Instruct Description	Text	InstructAudio	CosyVoice2
A young adult female speaks with a neutral emotion. Her delivery is clear and she has a Mandarin accent.	李大钊在敌人的法庭上慷慨陈词毫不畏惧。
The speaker is an elderly woman who uses a measured, neutral tone of voice with an American accent.	I have come to understand. That you were the best daughter you could be.
You'll hear a young adult woman speaking Mandarin in a happy and casual manner.	你写板报的速度可真快。
With a playful style and a happy emotion, a middle-aged American woman is speaking.	There it is. My baby.
An aggressive and angry tone characterizes this audio from a young adult female speaker with a Mandarin accent.	你这种慷他人之慨的做法，让人嗤之以鼻。
A middle-aged woman speaks angrily and aggressively, marked by an American accent.	You stole something from me like a petty little thief.
The audio features a young adult female who sounds sad; she speaks softly with a Mandarin accent.	我的故事过程很美，而结局却满是悲伤。
This recording contains a middle-aged man speaking American English. His emotional state is sad, but his delivery is aggressive.	He didn't die from an asthma attack. He died because she killed him here.
In this audio, we hear a female child speaking with a curious tone and a playful style, using a Mandarin accent.	大人还玩跷跷板？
The speaker is an elderly male who speaks seriously and authoritatively in a Mandarin accent.	我的练习本怎么突然没了？
A female child expresses curiosity in a playful manner, and her speech is delivered with a Mandarin accent.	因此官方此次对 E二升级版融入了更多实用型的功能。
The audio features a single speaker, identified as speaker zero, who is an elderly woman. She speaks in a neutral tone with a measured, deliberate style and has an American accent.	I have come to understand. That you were the best daughter you could be.
In this audio, a middle-aged male speaker can be heard. His tone is happy and his casual speaking style is delivered with an American accent.	What? That's good. Well good luck with that. This is hilarious.

TTM (Text-to-Muisc) Task

Instruct Description	Lyrics	InstructAudio	DiffRhythm+	ACE-Step
This audio features a male singer in his 20s performing a pop song. His soft, melodic tone carries a touch of melancholy and introspection. The instrumentation includes piano, synthesizer, and electric guitar, all set to a moderate tempo of 120 BPM in the key of C major. Together, these elements create a dreamy and reflective atmosphere, evoking a sense of nostalgia and longing.	紧紧拥抱唯一的你无可救药的坚定。
A female vocalist in her late teens or early twenties performs this electronic pop piece with a soft and melodic delivery. The track incorporates synthesizers, drum machines, and a female vocal sample, driven by a moderate 85 BPM rhythm in A minor. A repetitive and catchy synth riff anchors the melody, contributing to a dreamy, nostalgic mood that feels both introspective and gently melancholic, with subtle warmth and sweetness.	Like you to see the things I hide.
With a smooth and melodic vocal tone, a male singer in his 30s performs this pop song. The arrangement features piano and strings accompanying a moderate 120 BPM tempo in F major. The overall mood is reflective and dreamy, conveying a strong sense of nostalgia and longing through its gentle melody.	他们已几万年好像天经地义不容改变。
This electronic pop track is sung by a female vocalist in her twenties, whose tone is both smooth and melodic. Built around synthesizers, a drum machine, and a female vocal sample, the music moves at 100 BPM in the key of E♭ major. The result is a dreamy, ethereal soundscape that feels tranquil and introspective—like floating through a serene, otherworldly environment filled with warmth and emotion.	I was your treasure treasure treasureI was your treasure.
A vibrant and lively pop track is brought to life by a female singer in her late teens to early twenties, whose bright and energetic tone radiates youthful excitement. The instrumentation, featuring a synthesizer, drum machine, and bass guitar, creates a modern and upbeat sound perfect for dancing. This catchy atmosphere of joy and celebration is propelled by a tempo of 128 BPM in the key of D minor.	Wild child Hey hey hey heyGonna be your one only one only one wild child Hey hey hey hey Wild child.
With a smooth and melodic voice, a male vocalist in his 20s performs this warm and inviting pop song. The music combines a piano, synthesizer, and drum machine to craft a soothing and uplifting atmosphere, reminiscent of a gentle breeze on a sunny day. Set at a moderate 110 BPM in A major, the track is perfect for a relaxing or romantic moment.	Love you love you love youLove you 我的宝贝无法停。
This pop piece evokes a deep sense of nostalgia and melancholy, delivered by a male singer in his 20s with a warm, soothing tone. The instrumentation includes guitar, bass, drums, and a keyboard, all supporting a moderate tempo of around 130 BPM in D minor. The gentle strumming of the guitar and the overall arrangement create a reflective and deeply emotional atmosphere.	真心被摧残心痛像洪水泛滥。
A female vocalist in her twenties performs with a soft and melodic tone in this warm, nostalgic pop song. The gentle strumming of the guitar, combined with the bass and drum, evokes a sense of longing and reflection, as if the singer is reminiscing about past memories. The overall mood is soothing and intimate, carried by a moderate tempo and the harmonious key of G major.	想来想去真正怨差麦当作你真正性格。
Energetic and passionate vocals from a female singer in her 20s define this optimistic pop track. The music features guitar, bass, and drums, driving forward at 134 BPM in the key of C major. The overall feeling is one of youthful energy and urgency, much like a warm summer breeze filled with excitement.	My head is blown away away away away away.
Smooth, melodic vocals from a female singer in her twenties guide this electronic pop track on a dreamy, ethereal journey. The instrumentation, built on synthesizers, a drum machine, and a female vocal sample, creates a sense of floating through a futuristic soundscape at 100 BPM in E♭ major. The overall atmosphere is one of tranquility and introspection, feeling both weightless and emotionally warm.	I was your treasure treasure treasureI was your treasure.

Dialogue-TTS Task

Instruct Description	Text	InstructAudio
This audio features two speakers with distinct vocal characteristics. The first speaker is a young adult female who speaks with a neutral emotional tone. Her delivery is conversational in style, and she speaks Mandarin with a discernible accent. Meanwhile, the second speaker is a young adult male whose speech is also emotionally neutral. He speaks in a casual manner using a standard American accent.	[speaker id 0] 你觉得这个类型跟你平时的行为一致吗？[speaker id 1] Well, that fits nicely.
This audio features two distinct speakers. The first speaker is a young adult female. Her speech conveys a sad emotion and is delivered in a hesitant manner. She speaks with a discernible Mandarin accent. The second speaker is a young adult male who maintains a neutral emotional tone throughout his delivery. His style of speech is notably clear, and he uses a standard American accent.	[speaker id 0] 我曾经试图放下这些，但每次还是不由自主的比较，心里感觉自己永远不够好。[speaker id 1] This is really the paradox of modern society.
This audio features two distinct speakers. The first speaker is a young adult female. Her delivery is measured in style, and she speaks with a calm emotional tone. A discernible Mandarin accent is present in her speech. We also hear a young adult male. His emotional tone is neutral throughout, and he articulates his words in a clear style. He speaks with a standard American accent.	[speaker id 0] 毕竟，有些情感是需要时间去沉淀的。[speaker id 1] Yeah, maybe I should.
This audio recording features two distinct speakers. The first speaker is a young adult male. He speaks in a neutral emotional tone and maintains a casual style of delivery. His accent is standard American. We also hear a young adult female. Her speech carries a calm emotional quality and is delivered in a very natural manner. She speaks with a clear Mandarin accent.	[speaker id 0] It's still waiting until the last minute. [speaker id 1] 我也是啊，尤其是任务太多的时候。
This audio features a dialogue between two speakers. The first speaker is a young adult female who speaks with an angry and aggressive tone in a Mandarin accent. In contrast, the second speaker is a young adult male whose delivery is calm and measured, also speaking with a Mandarin accent.	[speaker id 0] 我先坐下了，哪里都有你。[speaker id 1] 我刚才只是起来帮老人让个座。
A young adult woman delivers her lines in an angry, aggressive style, marked by a Mandarin accent. Her speech is countered by a young adult man who maintains a calm and measured tone, also speaking with a Mandarin accent.	[speaker id 0] 根本不顾及我的需求。[speaker id 1] 我知道你不喜欢，但我也没有别的办法。空调一直开，真的很浪费。
The audio clip contains a conversation where a young adult man speaks in a neutral and clear manner with a Mandarin accent. He is followed by a young adult woman who responds with a calm and measured delivery in the same accent.	[speaker id 0] 主动加个好友，至少能让更多人认识你。[speaker id 1] 我宁愿慢慢认识。
A young adult female speaks in a calm and clear tone, using a Mandarin accent. She is joined by a young adult male who contributes with a neutral and clear speaking style, also in Mandarin.	[speaker id 0] 感觉这样的活动本身就很有意义。[speaker id 1] 对，而且学到了很多动手技巧。