ArtSpeech

Paper Introduction

We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for humanlike speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We also design a multidimensional style mapping network to extract speaking styles from the articulatory representations, guided by which variation predictors could predict the final mel spectrogram output. To validate the effectiveness of our approach, we conducted comprehensive experiments and analyses using the widely recognized speech corpus, such as LJSpeech and LibriTTS datasets, yielding promising similarity enhancement between the generated results and the target speaker’s voice and prosody. The code is available at ArtSpeech GitHub Repository.

Contents

Single Speaker (LJSpeech)
Zero-Shot Speaker Adaptation (LibriTTS)
Style Combination and Interpolation
Ablation Study
Celebrity Speech

Single Speaker (LJSpeech)

Text	Ground Truth	ArtSpeech	VITS	StyleTTS	StyleTTS2	FastSpeech2
On each lobe of the bi-lobed leaf of Venus flytrap.
Refuted by abundant evidence, and having no foundation whatever in truth.
Who had been greatly upset by her experience, was able to view a lineup of four men handcuffed together at the police station.

Zero-Shot Speaker Adaptation (LibriTTS)

Text	Speaker Prompt	Ground Truth	ArtSpeech	StyleTTS	StyleTTS2	VALL-E-X	YourTTS
The condition is that I will be permitted to make Luther talk American, 'streamline' him, so to speak-because you will never get people, whether in or outside the Lutheran Church, actually to read Luther unless we make him talk as he would talk today to Americans.
"I think I must show you my Patchwork Girl," said Margolotte, laughing at the boy's astonishment, "for she is rather difficult to explain."
I liked Naomi Colebrook at first sight; liked her pleasant smile; liked her hearty shake of the hand when we were presented to each other.
Wylder was laughing rather redly, with the upper part of his face very surly, I thought.
All that I am doing is to use its logical tenability as a help in the analysis of what occurs when we remember.

Style Combination and Interpolation

ArtSpeech employs a variety of independent articulatory style vectors to decouple the speaking styles of the target speaker. This enables the creation of unique speaking styles, distinct from the reference audio, by freely combining or interpolating these style vectors.

Example 1

Speaker A (reference)	Speaker A (synthesized)	Speaker B (reference)	Speaker B (synthesized)

Style Combination

Pitch Style of A + Remaining Style of B	Energy Style of A + Remaining Style of B	Mel and TVs Style of A + Remaining Style of B

Style Interpolation

The percentage of speaker B	0%	10%	20%	30%	40%
Synthesized speech

50%	60%	70%	80%	90%	100%

Example 2

Speaker A (reference)	Speaker A (synthesized)	Speaker B (reference)	Speaker B (synthesized)

Style Combination

Pitch Style of A + Remaining Style of B	Energy Style of A + Remaining Style of B	Mel and TVs Style of A + Remaining Style of B

Style Interpolation

The percentage of speaker B	0%	10%	20%	30%	40%
Synthesized speech

50%	60%	70%	80%	90%	100%

Ablation Study

Text	ArtSpeech	w/o multi-style	w/o articulatory encoder	w/o duration encoder	w/o training step 4	w/o TVs	w/o reg loss
Later on he had devoted himself to the personal investigation of the prisons of the United States.
The preference given to the Pentonville system destroyed all hopes of a complete reformation of Newgate.
He was a tall, slender man, with a long face and iron-gray hair.

Celebrity Speech

Using a small amount of speech data from the target speaker to fine-tune the model, in order to achieve high similarity wild data adaptive synthesis.

Joseph Robinette Biden

Reference Speech:

Synthesized Speech:

"I know; but that renders your uncle a most agreeable companion and gossip," declared Dr Pipt.	The impressions of footsteps were numerous, but they all appeared like those of men who had wandered about the spot, without any design to quit it.	My friend did not appear to be depressed by his failure, but shrugged his shoulders in half humorous resignation.

Donald John Trump

Reference Speech:

Synthesized Speech:

From the under surface of the clouds there are continual emissions of lurid light; electric matter is in continual evolution from their component molecules; the gaseous elements of the air need to be slaked with moisture; for innumerable columns of water rush upwards into the air and fall back again in white foam.	She held her tongue, but from that time she told everybody that I was an impostor.	I will show you what a good job I did," and she went to a tall cupboard and threw open the doors.