Skip to the content.

Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

Abstract

Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre. Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable. In response to this problem, we propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels. Firstly, a reference encoder structure based on quantized variational autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style representations. Secondly, a speaker-wise batch normalization layer is proposed to reduce the source speaker leakage. In order to improve the style extraction ability of the reference encoder, a style invariant and contrastive data augmentation method is proposed. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.


Model Architecture

Overall Architecture


Style Transfer

Style 1

Target Speaker Reference Baseline Proposed
Target text 1 "然后,他们来到了那家名叫胜利的饭店,饭店是在一座石桥的桥堍,它的屋顶还没有桥高,屋顶上长满了杂草,在屋檐前伸出来像是脸上的眉毛。"
Target text 2 "饭店看上去没有门,门和窗连成一片,中间只是隔了两根木条,许三观他们就是从旁边应该是窗户的地方走了进去"
Target text 3 "他们坐在了靠窗的桌子前,窗外是那条穿过城镇的小河,河面上漂过去了几片青菜叶子。 "

Style 2

Target Speaker Reference Baseline Proposed
Target text 1 "然后,他们来到了那家名叫胜利的饭店,饭店是在一座石桥的桥堍,它的屋顶还没有桥高,屋顶上长满了杂草,在屋檐前伸出来像是脸上的眉毛。"
Target text 2 "饭店看上去没有门,门和窗连成一片,中间只是隔了两根木条,许三观他们就是从旁边应该是窗户的地方走了进去"
Target text 3 "他们坐在了靠窗的桌子前,窗外是那条穿过城镇的小河,河面上漂过去了几片青菜叶子。 "

Style 3

Target Speaker Reference Baseline Proposed
Target text 1 "然后,他们来到了那家名叫胜利的饭店,饭店是在一座石桥的桥堍,它的屋顶还没有桥高,屋顶上长满了杂草,在屋檐前伸出来像是脸上的眉毛。"
Target text 2 "饭店看上去没有门,门和窗连成一片,中间只是隔了两根木条,许三观他们就是从旁边应该是窗户的地方走了进去"
Target text 3 "他们坐在了靠窗的桌子前,窗外是那条穿过城镇的小河,河面上漂过去了几片青菜叶子。 "

Style 4

Target Speaker Reference Baseline Proposed
Target text 1 "然后,他们来到了那家名叫胜利的饭店,饭店是在一座石桥的桥堍,它的屋顶还没有桥高,屋顶上长满了杂草,在屋檐前伸出来像是脸上的眉毛。"
Target text 2 "饭店看上去没有门,门和窗连成一片,中间只是隔了两根木条,许三观他们就是从旁边应该是窗户的地方走了进去"
Target text 3 "他们坐在了靠窗的桌子前,窗外是那条穿过城镇的小河,河面上漂过去了几片青菜叶子。 "

Style 5

Target Speaker Reference Baseline Proposed
Target text 1 "然后,他们来到了那家名叫胜利的饭店,饭店是在一座石桥的桥堍,它的屋顶还没有桥高,屋顶上长满了杂草,在屋檐前伸出来像是脸上的眉毛。"
Target text 2 "饭店看上去没有门,门和窗连成一片,中间只是隔了两根木条,许三观他们就是从旁边应该是窗户的地方走了进去"
Target text 3 "他们坐在了靠窗的桌子前,窗外是那条穿过城镇的小河,河面上漂过去了几片青菜叶子。 "

Single-dimensional control of global style embedding


Target Speaker
Reference
Target text: "然后,他们来到了那家名叫胜利的饭店,饭店是在一座石桥的桥堍,它的屋顶还没有桥高,屋顶上长满了杂草,在屋檐前伸出来像是脸上的眉毛。"

originaldimension 7dimension 9dimension 10
mel mel mel mel