Adaptive Text-to-Speech Synthesis With Articulatory Representations
Paper Introduction
We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for humanlike speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We also design a multidimensional style mapping network to extract speaking styles from the articulatory representations, guided by which variation predictors could predict the final mel spectrogram output. To validate the effectiveness of our approach, we conducted comprehensive experiments and analyses using the widely recognized speech corpus, such as LJSpeech and LibriTTS datasets, yielding promising similarity enhancement between the generated results and the target speaker’s voice and prosody. The code is available at ArtSpeech GitHub Repository.
The condition is that I will be permitted to make Luther talk American, 'streamline' him, so to speak-because you will never get people, whether in or outside the Lutheran Church, actually to read Luther unless we make him talk as he would talk today to Americans.
"I think I must show you my Patchwork Girl," said Margolotte, laughing at the boy's astonishment, "for she is rather difficult to explain."
I liked Naomi Colebrook at first sight; liked her pleasant smile; liked her hearty shake of the hand when we were presented to each other.
Wylder was laughing rather redly, with the upper part of his face very surly, I thought.
All that I am doing is to use its logical tenability as a help in the analysis of what occurs when we remember.
ArtSpeech employs a variety of independent articulatory style vectors to decouple the speaking styles of the target speaker. This enables the creation of unique speaking styles, distinct from the reference audio, by freely combining or interpolating these style vectors.
Using a small amount of speech data from the target speaker to fine-tune the model, in order to achieve high similarity wild data adaptive synthesis.
Joseph Robinette Biden
Reference Speech:
Synthesized Speech:
"I know; but that renders your uncle a most agreeable companion and gossip," declared Dr Pipt.
The impressions of footsteps were numerous, but they all appeared like those of men who had wandered about the spot, without any design to quit it.
My friend did not appear to be depressed by his failure, but shrugged his shoulders in half humorous resignation.
Donald John Trump
Reference Speech:
Synthesized Speech:
From the under surface of the clouds there are continual emissions of lurid light; electric matter is in continual evolution from their component molecules; the gaseous elements of the air need to be slaked with moisture; for innumerable columns of water rush upwards into the air and fall back again in white foam.
She held her tongue, but from that time she told everybody that I was an impostor.
I will show you what a good job I did," and she went to a tall cupboard and threw open the doors.