SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao

Tianjin University, Tianjin, China
Kuaishou Technology, Beijing, China
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tsinghua University, Beijing, China
Shenzhen Institute of Advanced Technology, Chinese Academy of Science, Guangdong, China

Abstract

Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of tokens while maintaining high codebook utilization. A semantic disentanglement method based on contrastive learning is proposed, which aligns text and speech in a joint multimodal frame-level space, effectively removing paralinguistic information from semantic encoding. An acoustic-constrained multi-stage optimization strategy is proposed to ensure robust and stable convergence. Figure 1 shows SecoustiCodec achieves SOTA (state-of-the-art) reconstruction quality (PESQ) of 1.77/2.58 at 0.27/1 kbps. We've open-sourced SecoustiCodec's demo, code, and model weights.

arXiv Paper

View Paper

GitHub Repository

Access Repository

Hugging Face

Visit Space

Experiment

This page shows the samples in the paper "SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec".

For the labeled text-speech paired data, we integrate our internal dataset with the AISHELL-3 dataset and the LibriTTS dataset, resulting in a combined total of 1,000 hours of recordings from 3,000 speakers.

All test samples and speakers have not appeared in the training set and validation set.

To ensure fairness in all objective metric evaluations, as the sampling rates of the comparison methods differ, the audio generated by SecoustiCodec and all comparison models is downsampled to 16kHz.

Model Architecture

Model Architecture
SecoustiCodec includes three modeling processes: (a) Acoustic Modeling, (b) Semantic Modeling, and (c) Paralinguistic Modeling. Modules outlined in red operate in a streaming manner, while those in blue are non-streaming. Phoneme embeddings (P) are extracted from text, and target semantic embeddings (S), acoustic embeddings (A), and paralinguistic embeddings (G) are extracted from speech. (P) and (S) are used to construct token-acoustic contrastive loss, which learns frame-level (dis)similarity between a batch of speech and text pairs. In the inference process, Acoustic Projection is not required; instead, semantic embedding and paralinguistic embedding are used to predict acoustic embedding. The mean values (μ and \( \hat{\mu} \)) from the VAE structure are directly used as inputs during inference, bypassing stochastic sampling.

Inference

Inference Process
As illustrated in (a), the speech reconstruction using semantic encoding involves semantic embeddings that are discrete values from a single codebook. Notably, in TTS or voice-based dialogue tasks, the target speaker's timbre is often fixed. Therefore, predefined speech segments are used to extract paralinguistic embeddings, ensuring consistency in the synthesized speech's paralinguistic information. In the speech reconstruction task, a segment from the middle of the input speech is used to extract the paralinguistic embedding. Both semantic and paralinguistic encoding utilize a VAE structure, where the mean(μ and μ̂) is directly used as input during inference, bypassing stochastic sampling.

Speech Reconstruction Task

Text Ground Truth SecoustiCodec
(1 kbps)
SecoustiCodec
(0.27 kbps)
WavTokenizer
(0.5 kbps)
TAAE
(0.6 kbps)
VQ-CTAP
(0.33 kbps)
BigCodec
(1 kbps)
Encodec
(1.5 kbps)
FACodec
(1.6 kbps)
SpeechTokenizer
(1 kbps)
MimiCodec
(0.55 kbps)
无法对此作出评论,客人有事尽管吩咐!
北约外长磋商波黑局势,他也一跃成为千万富翁。
钱要花在刀刃儿上,月光般柔和的羽毛。
鳄鱼屿恰恰位处翔安琼头。
法瑞尔与新欢艾格尼斯,翁凯兰防乳腺癌宣传照。
鹅蛋找鸭蛋比大小,鸭蛋说:我不和你比,我跟鸡蛋比。
无法对此作出评论,客人有事尽管吩咐!
以为中餐就是酸甜肉左宗棠鸡烤鸭馄饨炒面饺子之类的。
大雨旁沱,电闪雷呜,令人耳聩目弦。
有蘑菇炒青菜、油豆腐烧肉、回锅肉、丝瓜文葛汤、酸菜鱼。
以为中餐就是甜酸肉左宗棠鸡烤鸭馄炖炒面饺子之类的。
这是老师第一次表扬我,我非常地开心。
烧花鸭,烧雏鸡儿,这城市交通有问题。
唉你怎么越长越窝囊。
几天用一张,用完就扔,高压铁塔下的低矮棚屋。
法院准许合同解除,欢声笑语洒满村庄。
几天用一张,用完就扔,高压铁塔下的低矮棚屋。
熊熊火焰从教堂两座钟楼间窜出,塔尖随后轰然倒塌。
为了试验那个号称离子等距喷雾柔顺衣料和高频紫外线杀菌的功能。
卡迪姿,美丽爱上你!
追悔莫及,悔之晚矣。
And yet a third time he sighs, said smee. Then at last he spoke passionately. the game's up, he cried, those boys have found a mother. Affrighted though she was, wendy swelled with pride. o evil day! cried starkey.,what's a mother? asked the ignorant smee.
An ear as big as a man! I looked still more attentively—and actually there did move under the ear something that was pitiably small and poor and slim.
We entered together; Catherine was there, making herself useful in preparing some vegetables for the approaching meal. She looked more sulky and less spirited than when I had seen her first. She hardly raised her eyes to notice me, and continued her employment with the same disregard to common forms of politeness as before; never returning my bow and good morning by the slightest acknowledgment. "She does not seem so amiable," I thought, "as Mrs. Dean would persuade me to believe."
But Sir Hugh was altogether of a different opinion. Though he had already asked his sister-in-law to Clavering, when the idea had first come up, he was glad that she had declined the visit. Her coming might be very well if she accepted Archie; but he did not want to be troubled with any renewal of his responsibility respecting her if, as was more probable, she should reject him. The world still looked askance at Lady Ongar, and Hugh did not wish to take up the armor of a paladin in her favor.
He was thankful, at any rate, that Louis in this two years' interval had finally transferred his heart elsewhere. I told you.
For since natural affection rests upon natural unity, the angel naturally loves less what is less one with him. Consequently, he loves more what is numerically one with himself than what is one only generically or specifically. But it is natural for him to have a like love for another as for himself, in this respect, that as he loves self in wishing well to self, so he loves another in wishing well to him.
Frederick did love her then—he must love her, or why had he come? Her thoughts wouldn't go on. Her mind stammered. She couldn't think. She could only see and feel. She didn't know how it had happened. It was a miracle.

VC Demos

Source Speech
Text: 她把鞋子拎在手上光着脚丫故意踩在水洼里。
Prompt (Target Speaker) SecoustiCodec
Source Speech
Text: 我身上分文没有
Prompt (Target Speaker) SecoustiCodec
Source Speech
Text: 她把他那件整洁的上装的衣扣统统扣上
Prompt (Target Speaker) SecoustiCodec

Codebook Utilization

Codebook Utilization
VAE-FSQ achieves the highest codebook utilization, with nearly all tokens activated, significantly surpassing VQ-VAE and SimVQ. Additionally, VAE-FSQ minimizes the long-tail effect, with most token frequencies below 0.2%, which aids language model training in TTS and voice dialogue tasks.

The Spectrograms, F0, and Energy of synthesized speech.

spectrograms
F0, and energy of speech generated by SecoustiCodec using the same semantic encoding combined with paralinguistic embeddings from different speakers (equivalent to a voice conversion task). As shown in the figure, the synthesized speech exhibits strong alignment with the paralinguistic speech across multiple frequency domains, F0, and energy spectrograms. This is attributed to the disentanglement of paralinguistic information in the semantic encoding and the paralinguistic modeling capability of the paralinguistic encoder in SecoustiCodec.