Browsing by study line "Phonetics"

Now showing items 1-11 of 11

Analysis of a Latent Prosody Space for Controlling Speaking Styles in Finnish End-to-End Speech Synthesis

Törö, Tuukka (2022)

In recent years, advances in deep learning have made it possible to develop neural speech synthesizers that not only generate near natural speech but also enable us to control its acoustic features. This means it is possible to synthesize expressive speech with different speaking styles that fit a given context. One way to achieve this control is by adding a reference encoder on the synthesizer that works as a bottleneck modeling a prosody related latent space. The aim of this study was to analyze how the latent space of a reference encoder models diverse and realistic speaking styles, and what correlation there is between the phonetic features of encoded utterances and their latent space representations. Another aim was to analyze how the synthesizer output could be controlled in terms of speaking styles. The model used in the study was a Tacotron 2 speech synthesizer with a reference encoder that was trained with read speech uttered in various styles by one female speaker. The latent space was analyzed with principal component analysis on the reference encoder outputs for all of the utterances in order to extract salient features that differentiate the styles. Basing on the assumption that there are acoustic correlates to speaking styles, a possible connection between the principal components and measured acoustic features of the encoded utterances was investigated. For the synthesizer output, two evaluations were conducted: an objective evaluation assessing acoustic features and a subjective evaluation assessing appropriateness of synthesized speech in regard to the uttered sentence. The results showed that the reference encoder modeled stylistic differences well, but the styles were complex with major internal variation within the styles. The principal component analysis disentangled the acoustic features somewhat and a statistical analysis showed a correlation between the latent space and prosodic features. The objective evaluation suggested that the synthesizer did not produce all of the acoustic features of the styles, but the subjective evaluation showed that it did enough to affect judgments of appropriateness, i.e., speech synthesized in an informal style was deemed more appropriate than formal style for informal style sentences and vice versa.
Analysis of speech prosody using Variational Autoencoder bottleneck features : the Lombard effect

Suviranta, Rosa (2021)

This study is a preliminary study to verify how well a Conditioned Convolutional Variational Autoencoder (CCVAE) learns the prosodic characteristics of interaction between the Lombard effect and different focus conditions. Lombard speech is an adaptation to ambient noise manifested by rising vocal intensity, fundamental frequency, and duration. Focus marks new propositional information and is signalled by making the focused word more prominent in relation to others. A CCVAE was trained on the f0 contours and speech envelopes of a Lombard speech corpus of Finnish utterances. The model’s capability to reconstruct the prosodic charac- teristics was statistically evaluated based on bottleneck representations alone. The following questions were addressed: the appropriate size of the bottleneck layer for the task, the ability of the bottleneck representations to capture the prosodic characteris- tics and the encoding of the bottleneck representations. The study shows promising results. The method can elicit representations that can quantify prosodic effects of the underlying influences and interactions. The study found that even the low dimensional bottlenecks can conceptualise and consis- tently typologize the prosodic events of interest. However, finding the optimal bottleneck dimension still needs more research. Subsequently, the model’s ability to capture the prosodic characteristics was verified by investigating the generated samples. Based on the results, the CCVAE can capture prosodic events. The quality of the reconstruction is positively correlated with the bottleneck dimension. Finally, the encoding of the bottlenecks were examined. The CCVAE encodes the bottleneck representations similarly regardless of the training instance or the bottleneck dimension. The Lombard effect was most efficiently captured and focus conditions as second.
Asymmetrical Lombard Effect : Conversating in Loud and Quiet Environments Simultaneously

Wikström, Alexandra (2022)

Ihmiset muuttavat äänentuotantoaan kuuluvammaksi meluisassa ympäristössä refleksinomaisesti. Tätä ilmiötä kutsutaan Lombard-efektiksi. Efekti saa puhujan tuottamaan Lombard-puhetta, jota on tutkittu jo yli vuosisadan ajan eri näkökulmista. Lombard-puheen akustiikalle ominaista ovat korotettu äänenpainetaso, korotettu puheäänen perustaajuus, muutokset äänen osataajuuksissa sekä muissa äänen spektrin rakenteissa. Lisäksi Lombard-puheessa vokaalien pituuksilla on tapana kasvaa, ja äärimmäisissä meluolosuhteissa hyperartikulaatiota voi esiintyä. Puhetilanteeseen sisältyvä kommunikatiivinen aspekti on keskeistä ilmiön synnylle. Tämän tutkielman tavoitteena oli tutkia puheentuottoa keskustelutilanteessa, jossa samanaikaisesti toinen keskustelijoista on altistettuna melulle ja tuottaa täten Lombard-puhetta, ja toinen keskustelija kommunikoi hiljaisuudessa ilman taustamelun suoria vaikutuksia, ja selvittää, onko puheen akustiikassa tai ymmärrettävyydessä eroavaisuuksia tällaisessa epäsymmetrisessä tilanteessa verrattuna symmetriseen puhetilanteeseen, jossa molempien puhujien ääniympäristö on sama. Tutkimusta varten kaksi paria suomenkielisiä keskustelijoita (yhteensä neljä osallistujaa, kaikki naisia) ratkoivat pareittain sudokupohjaisia tehtäviä kolmessa eri taustamelutilanteessa: (1) hiljaisuudessa, (2) molempien ollessa taustamelussa (symmetrinen), ja (3) vain toisen keskustelijan ollessa taustamelussa (asymmetrinen). Taustamelu, jota soitettiin koehenkilöille 75 dB äänenpainetasolla, oli laadultaan cocktail-melua, joka sisältää niin kutsuttua puheensorinaa jossa useampi puhuja puhuu päällekkäin. Keskustelut äänitettiin ja niistä kerättiin yhteensä 453 maalitavua, joista kaikista analysoitiin keskimääräinen äänenpainetaso, ja 417 maalitavusta analysoitiin keskimääräinen perustaajuus. Äänenpainetason ja perustaajuuden arvot normalisoitiin ja arvoille suoritettiin keskiarvoja ja variansseja vertailevat tilastolliset testit. Odotetusti kaikki puhujat korottivat äänenpainetasoaan ja perustaajuuttaan siirryttäessä hiljaisesta keskustelutilanteesta symmetriseen taustamelutilanteeseen, jossa molemmat keskustelukumppanit tuottivat Lombard-puhetta. Henkilöt, jotka asymmetrisessä keskustelutilanteessa olivat itse hiljaisuudessa ja kommunikoivat keskustelukumppanille, joka oli melussa, korottivat sekä äänenpainetasoaan että perustaajuuttaan asymmetrisessä keskustelutilanteessa verrattuna hiljaiseen keskustelutilanteeseen. Lisäksi toinen näistä puhujista korotti sekä äänenpainetasoaan että perustaajuuttaan lähes oman Lombard-puheensa tasolle, jota mitattiin symmetrisessä tilanteessa. Puhujat, jotka olivat altistettuna melulle asymmetrisessä tilanteessa, käyttivät keskimäärin matalampaa äänenpainetasoa asymmetrisessä kuin symmetrisessä tilanteessa, vaikka tuottivatkin Lombard-puhetta molemmissa tilanteissa. Väärin kuultuja maalitavuja ei havaittu asymmetrisessä tilanteessa, vaan henkilöt, jotka olivat kyseisessä tilanteessa hiljaisuudessa, onnistuivat korottamaan ääntään tarvittavalle tasolle, jotta ratkaiseva tieto saatiin kommunikoitua melussa olevalle henkilölle. Tämä tutkimus osoitti, että kahden keskustelukumppanin ääniympäristöjen ollessa eriävät, kumpikaan keskustelijoista ei tuota täysin sentyyppistä puhetta, joka olisi sopivaa heidän senhetkiseen ääniympäristöönsä, vaan puheentuottoon vaikuttaa myös välillisesti keskustelukumppanin ääniympäristö. Lisäksi tutkimus osoitti, että siinä missä puhetilanteen kommunikatiivisuus voi lisätä Lombard-efektin vaikutuksia, se voi myös häivyttää niitä. Jatkotutkimuksissa tulisi kerätä enemmän dataa ja suorittaa datalle laajempaa analyysiä.
Digitaalisesti simuloidun tilan kaiunnan vaikutus puheen tuottamiseen

Asikainen, Atte (2021)

It is common for speech to occur in closed spaces. Hence, room acoustics have a significant role in speech communication. In previous studies, effects of reverberation on speech production have been found. However, research on the concerned field is yet scarce. Adverse room acoustics have been observed to expose occupational speakers, such as teachers, to voice disorders. Thus, it is crucial to study what are the room acoustic requirements for economic speaking. The purpose of this study is to examine which speech-acoustic traits change when the speaker is exposed to reverberation, and how. In the present study, two different approaches are taken: variation of reverberation time and removal of the reverberation. The changes in speech are reflected to the Lombard sign (the raise of speech level in a noisy environment). Additionally, differences related to gender and prosody are examined concerning the present topic. In this study, a speech production experiment was conducted with acoustic and statistical analyses. 11 Finnish-speaking volunteers (six females and five males) participated the experiment, where 150 short sentences were recorded from each participant. The sentences were produced in five different room-acoustic conditions. In four out of five, digitally simulated reverberation was played back on headphones worn by the participant with varying reverberation times. The fifth condition was (nearly) anechoic. Out of the recorded sentences, speech rate, creak ratio and harmonics-to-noise ratio were measured along with mean, maximum and movement of intensity and pitch. The measurements were then assessed with various statistical methods. The results of the study show a significant decrease in speech rate caused by an increasing reverberation time. Additionally, speech rate was the highest in the anechoic condition. Moreover, creak ratio decreased greatly when reverberation time increased to more than one second especially on male speakers and end-weighted sentences. Additionally, monotonousness was higher in reverberated conditions than the anechoic condition. However, substantial speaker-dependent differences in the effects of reverberation on speech were found. Moreover, sentence weight was found to influence speech more fundamentally than reverberation. The results suggest that rooms with average reverberation times, rather than particularly long or short, seem the most beneficial for speaking. This observation corresponds to previous studies. Further research on the field is required to extract valuable knowledge needed in acoustical design of spaces, including classrooms. Designing speaker-friendly spaces helps to preserve occupational speakers’ voices throughout their careers.
From Listener Perceptions to Synthetic Smiles : Investigating Smiling Voice in Speech

Rouvinen, Alina (2023)

Smiling is fundamentally human but a more complex phenomenon than might appear at first glance. Studies in the field of language sciences have explored smiling in the context of speech and found that speaking while smiling has perceivable effects on the voice, and this phenomenon is commonly known as “smiling voice”. Although this phenomenon is widely recognised, there is no clear consensus on the precise acoustic characteristics that cue listeners to the presence of a smile. This study aims to investigate whether listeners can identify smiling voice based only on audio stimuli, what prosodic cues or characteristics they might be using to do so, and whether those cues can be extracted and used to replicate smiling voice using speech synthesis. Another aim of this study is to determine whether the level of perceived smiliness can be controlled in synthetic speech. These issues are addressed with the objectives of adding to the understanding of smiling voice in the field of phonetics and exploring the potential of speech synthesis technology for producing expressive speech. A corpus of Finnish speech was used to conduct a preliminary listening experiment where participants compared neutral and positive utterances in a questionnaire and indicated whether the speaker was smiling in the latter. Utterances that were identified as smiley were analysed acoustically to detect prosodic differences between neutral and smiley speech. Based on the results, formant frequencies F2 and F3 and centre of gravity were selected as prosodic cues to control smiling in speech synthesis. The speech synthesiser was a Tacotron 2 system, including a reference encoder, which was already trained on the speech corpus used. Synthesis evaluation was conducted with a second questionnaire where participants listened to the synthesised utterances and indicated how strongly the speaker was smiling. The results of the first questionnaire showed that listeners were able to distinguish neutral and smiley speech, and subsequent acoustic analyses indicated significant effects of smiling on fundamental frequency, formant frequencies, and centre of gravity. Speech synthesis evaluation results further indicated that F2, F3, and centre of gravity can be used to control the level of perceived smiling at least to the extent of a binary distinction. However, the evaluation showed that more sophisticated control of the level of smiling voice was not achieved.
Helsinkiläisen [s]-äänteen foneettinen tarkastelu : Onko "stadilaisella ässällä" foneettista pohjaa?

Koivisto, Emma (2022)

Niin kutsuttu helsinkiläinen ässä on 1800-luvulta lähtöisin oleva sosiolingvistinen ilmiö, jonka mukaan helsinkiläisillä suomen kielen puhujilla on muualla Suomessa asuvia terävämpi, sihisevämpi [s]-äänne. Ilmiön pitkästä historiasta huolimatta siitä ei ole aiemmin tehty foneettista tutkimusta joka selvittäisi, onko helsinkiläisten puhujien tuottama [s]-äänne akustisesti tarkasteltuna tavallista suomen kielen [s]-äännettä terävämpi. Tämän maisterintutkielman tavoitteena on tarkastella helsinkiläisten puhujien tuottamia [s]-äänteitä akustisin menetelmin. Tutkielmassa selvitetään, mitkä tekijät vaikuttavat pitkän [s]-äänteen terävyyteen, onko terävyydessä eroa mies- ja naispuhujien välillä ja ovatko helsinkiläisten tuottamat [s]-äänteet tavanomaista terävämpiä akustisesti mitattuna. Aineistona käytettiin Kielipankin tarjoaman Helsingin puhekielen pitkittäiskorpuksen vuonna 2013 kerättyä osakorpusta. Koehenkilöitä oli 13 ja heidän puheestaan poimittuja, analysoitavia pitkiä [s]-äänteitä 622 kpl. Pitkistä [s]-äänteistä mitattiin Centre of Gravity -arvo (COG), joka kuvaa, mille taajuusalueelle [s]-äänteen energia on keskimääräisesti sijoittunut. Kyseisen arvon voidaan ajatella kuvaavan [s]-äänteen terävyyttä, sillä terävässä [s]-äänteessä energia on sijoittunut korkeille taajuuksille ja vähemmän terävässä eli ääntöpaikaltaan takaisemmassa [s]-äänteessä matalammille taajuuksille. Työssä tarkasteltiin puhujan ominaisuuksien (sukupuoli, ikä, koulutustausta) sekä pitkää [s]-äännettä edeltävän vokaalin ominaisuuksien (etisyys/takaisuus, suppeus-/väljyysaste sekä pyöreys/laveus) vaikutusta [s]-äänteen COG-arvoon. Lisäksi tarkasteltiin COG-arvoltaan erityisen korkeiden (COG-arvo yli 6000 Hz) pitkien [s]-äänteiden osajoukkoa ja sitä, mitkä tekijät olivat COG-arvoltaan erityisen korkeiden [s]-äänteiden taustalla. Pitkästä [s]-äänteestä mitatun Centre of Gravity -arvon korkeuteen vaikuttivat vähintään tilastollisesti merkitsevästi niin puhujan sukupuoli, ikä, koulutustausta kuin [s]-äännettä edeltävän vokaalin etisyys/takaisuus, suppeus-/väljyysaste kuin sen pyöreys/laveus. COG-arvoltaan kaikista korkeimmat pitkät [s]-äänteet löydettiin ammattikoulutaustaisilta keski-ikäisiltä naispuhujilta edeltävän vokaalin ollessa etinen, lavea ja suppea tai puolisuppea. Naisten tuottamien [s]-äänteiden COG-arvo oli korkeampi kuin miesten, mikä tukee käsitystä, jonka mukaan helsinkiläinen ässä on erityisesti naisten puheen piirre. Tämän tutkimuksen perusteella ei voida todeta, että helsinkiläisten tuottamat [s]-äänteet olisivat keskimääräistä suomen [s]-äännettä terävämpiä, sillä aineistoon sisältyi niin Centre of Gravity -arvon kuin auditiivisen arvion perusteella monenlaisia pitkiä [s]-äänteitä. Mukana oli kuitenkin myös useita COG-arvoltaan hyvin korkeita ja auditiivisesti arvioituna hyvin teräviä [s]-äänteitä, mikä puolestaan viittaa sekä siihen, että COG-arvo olisi kelvollinen mittari [s]-äänteen terävyyden tarkastelemiseksi että ennen kaikkea siihen, että helsinkiläisellä ässällä voisi olla myös foneettista pohjaa.
Phonetic Fluency in Finnish as a Second Language : Acoustic Analysis of High School Students’ Spontaneous Speech

Koivusalo, Liisa (2022)

Speaking fluently is an important goal for second language (L2) learners. In L2 research, fluency is often studied by measuring temporal features in speech. These features include speed (rate of speech), breakdown (use of silent and filled pauses), and repair (self-corrections and repetitions) phenomena. Fluent speakers generally have a higher rate of speech and fewer hesitations and interruptions than beginner language learners. In this thesis, phonetic fluency of high school students’ L2 Finnish speech is studied in relation to human ratings of fluency and overall proficiency. The topic is essential for the development of automated assessment of L2 speech, as phonetic fluency measures can be used for predicting a speaker’s fluency and proficiency level automatically. Although the effect of different fluency measures on perceived fluency level has been widely studied during the last decades, research on phonetic fluency in Finnish as L2 is still limited. Phonetic fluency in high school students’ speech in L2 Finnish has not been studied before. The speech samples and ratings used in this thesis are a part of a larger dataset collected in the DigiTala research project. The analyzed data contained spontaneous speech samples in L2 Finnish from 53 high school students of different language backgrounds. All samples were assessed by expert raters for fluency and overall proficiency. The speech samples were annotated by marking intervals containing silent pauses, filled pauses, corrections and repetitions, and individual words. Several phonetic fluency measures were calculated for each sample from the durations of the annotated intervals. The contribution of phonetic fluency measures to human ratings of fluency and proficiency was studied using simple and multiple linear regression models. Speech rate was found to be the strongest predictor for both fluency and proficiency ratings in simple linear regression. Articulation rate, portion of long silent pauses, mean duration of long silent pauses, mean duration of breaks between utterances, and rate of short silent pauses per minute were also statistically significant predictors of both fluency and proficiency ratings. Multiple linear regression models improved the simple models for both fluency and proficiency: for fluency, a model with a combination of articulation rate and the portion of long silent pauses performed the best, and for proficiency, a model with a combination of speech rate and mean duration of short silent pauses. Perceived fluency level is often affected by a combination of different phonetic fluency measures, and it seems that human raters ground their assessments on this combination, although some phonetic fluency measures might be more important on their own than others. The findings of this thesis expand previous knowledge on phonetic fluency in L2 Finnish and can benefit both language learners and teachers, as well as developers of automatic assessment of L2 speech.
Pitch reset in Asante Twi, a dialect of Akan

Oppong, Olivia Serwaa (2021)

This thesis investigates the interaction between lexical tones and pitch reset in Akan, a Kwa language with about 8.1 million native speakers in Ghana (Eberhard et al., 2020). Experimental studies on Akan prosody are limited, although the language has a large first and second language speakers. This study seeks to increase our knowledge of the tone-intonation structure of the Akan language. In an earlier study on Akan complex declarative sentences, pitch reset occurred at the beginning of the content word that followed the clausal marker of an embedded clause (Kügler, 2016). Following a pilot study, a hypothesis was formed for the present study that pitch reset in complex declarative utterances in Akan also occurs within the clausal marker of the dependent clause and not only in the following content word. Focusing on the Asante Twi dialect, a controlled material consisting of 64 complex sentences were created. Five native speakers of Asante Twi were recorded as they produced the 64 sentences and additional 32 complex sentences used as fillers. The Mean f_0 values of the syllables of the subordinate conjunction and the syllables of the word before and after the conjunction were extracted and analysed in R; the statistical analysis was based on a linear mixed model. As expected, a reset in the pitch contour consistently occurred within the subordinate conjunction, contrasting the earlier study. The conjunction was phrased prosodically with the dependent clause to signal the syntactic relationship between the two. The degree of pitch register reset was also dependent on the tonal structure; reset was more significant when the initial tone of the conjunction was High but lesser when the conjunction began with a Low tone. Thus, the results show that lexical tones interact to determine the f_0 contour of Akan utterances and that the intonational contour of utterances is complex in the Akan language.
Sound-symbolic associations between vowel sounds and spatial meanings – an investigation of a sound-space phenomenon

Myllylä, Ida-Lotta (2023)

This thesis investigated a sound-space phenomenon related to sound-symbolic associations between vowel sounds [i] and [æ] and spatial meanings up and down. This vowel-height congruency effect was investigated with two experiments utilizing speeded choice reaction time (CRT) tasks. In Experiment 1, participants were required to vocalize [i] or [æ] while being presented with visual stimuli moving either up or down. The task was indirect, so that the phenomenon under investigation was masked by instructing the vocalizations to be produced according to distance of movement, rather than location. Due to this masking, the sound-magnitude effect typically associating high (close) vowels with small distances and low (open) vowels with large distances was also investigated in this thesis. In Experiment 2, participants produced responses according to the location of visual stimulus (up/down) or according to the aurally presented vowels [i] and [æ], while being presented with both stimuli simultaneously. In both experiments, reaction time (RT) measures were analyzed. In Experiment 1, acoustic characteristics (fundamental frequency F0, and formants F1, F2) of the vocalizations were also analyzed. The results showed, that there is a sound-symbolic association between the vowel [i] and spatial meaning up, based on the stimulus-response congruency observed in reaction time measures. The sound-magnitude effect was also found to be robust in these experiments. The sound-space association between [æ] and spatial meaning down was not found to be significant. The sound-space effect also emerged only in the experiment requiring vocalizations, and not in the experiment requiring manual responses. The sound-space effect was present in the reaction time measures, and not in the vocal characteristics of vocalizations. It was concluded, that the vowel-height congruency effect can be robustly observed (i.e., in relation to both vocal responses) only when the experimental task requires intentional and task-relevant processing of the concepts up and down. It was also estimated, that the sound-space effect related to vowel sounds [i] and [æ] and spatial meanings up/down may not be as strong, as for instance the sound-magnitude effect. Regarding the possible underlying mechanisms of sound-symbolic associations, some evidence supporting the embodiment-based articulatory views on sound symbolism was found. In addition, the intrinsic vowel pitch (IVP) phenomenon was replicated in this thesis, and it was demonstrated, that the intrinsic pitch is an important core property of vowel sounds that influences also sound-symbolic associations.
The perceived sincerity of synthetic speech : An investigation of commissive speech acts produced by a Finnish-speaking text-to-speech synthesizer

Weidinger, Lucas (2024)

Speech, as the evolutionary pinnacle of human communication, is not just defined by its content, but only gains meaning with prosody. Prosody plays a vital role in conveying the words spoken as well as the underlying emotions and intentions of the speaker. While certain aspects of prosody and its relation to emotion have been studied, the concept of sincerity within speech remains a complex and active area of research, spanning linguistic, ethical, and philosophical dimensions. This thesis explored the perception of sincerity in speech through the manipulation of prosodic features using neural network-based speech synthesis. The primary research question explored the impact of modifying prosody on the perceived sincerity of synthetic speech. Three prosodic features — speaking rate, f0-mean, and f0-standard deviation — were evaluated to refine the analysis. Commissive utterances, dependent on sincerity, form the basis of the research material. Data from 40 commissive utterances are subjected to eight prosodic modifications, and linear regression confirms the intended effects. A perception experiment involving 115 native Finnish speakers revealed intriguing results. While the categories of prosodic modifications showed no significant impact on perceived sincerity, analyzing individual prosodic feature values uncovered significant correlations. Increased speaking rate and f0-Standard Deviation correlated positively with perceived sincerity, validating secondary hypotheses. However, no significant correlation was found for increased f0-Mean. The null result in category-based analysis suggests some methodological limitations, possibly obscuring direct conclusions. Nevertheless, the nuanced prosodic features challenge participants' discernment, impacting sincerity evaluation. Due to only partially confirming the hypotheses due to practical constraints, future research avenues hold promise for uncovering deeper insights.
Tonal patterns in Southern Angami

Busheva, Anna (2023)

This thesis investigates the realisation of tone in dialects of Southern Angami, a language of Tibeto-Burman family spoken in the state of Nagaland, North-East India. The audio recordings of native speakers are analysed to determine how the tones differ in pitch movement patterns, accounting for context and dialect variation. The research questions concern the significance of pitch contours and duration in a level tone system, as well as tone unit interaction. It was concluded that the fundamental frequency is the main determining factor, and neither pitch contour nor duration have a more prominent effect than pitch value; however, it is possible that duration plays a role in discerning tones 2 and 3, and a pitch curve is a consistent feature of tones 1 and 4. No significant difference was found in tone systems of Jotsoma and Kigwema.

Now showing items 1-11 of 11

Browsing by study line "Phonetics"

Yhteystiedot

HELSINGIN YLIOPISTO