Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Author "Törö, Tuukka"

Sort by: Order: Results:

  • Törö, Tuukka (2022)
    In recent years, advances in deep learning have made it possible to develop neural speech synthesizers that not only generate near natural speech but also enable us to control its acoustic features. This means it is possible to synthesize expressive speech with different speaking styles that fit a given context. One way to achieve this control is by adding a reference encoder on the synthesizer that works as a bottleneck modeling a prosody related latent space. The aim of this study was to analyze how the latent space of a reference encoder models diverse and realistic speaking styles, and what correlation there is between the phonetic features of encoded utterances and their latent space representations. Another aim was to analyze how the synthesizer output could be controlled in terms of speaking styles. The model used in the study was a Tacotron 2 speech synthesizer with a reference encoder that was trained with read speech uttered in various styles by one female speaker. The latent space was analyzed with principal component analysis on the reference encoder outputs for all of the utterances in order to extract salient features that differentiate the styles. Basing on the assumption that there are acoustic correlates to speaking styles, a possible connection between the principal components and measured acoustic features of the encoded utterances was investigated. For the synthesizer output, two evaluations were conducted: an objective evaluation assessing acoustic features and a subjective evaluation assessing appropriateness of synthesized speech in regard to the uttered sentence. The results showed that the reference encoder modeled stylistic differences well, but the styles were complex with major internal variation within the styles. The principal component analysis disentangled the acoustic features somewhat and a statistical analysis showed a correlation between the latent space and prosodic features. The objective evaluation suggested that the synthesizer did not produce all of the acoustic features of the styles, but the subjective evaluation showed that it did enough to affect judgments of appropriateness, i.e., speech synthesized in an informal style was deemed more appropriate than formal style for informal style sentences and vice versa.