Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Kielellisen diversiteetin ja digitaalisten ihmistieteiden maisteriohjelma"

Sort by: Order: Results:

  • Sholihat, Aliva (2023)
    Statistical learning is a universal cognitive mechanism that allows humans to detect patterns and regularities in their environment, playing a crucial role in various cognitive functions, including language acquisition. This research delved into the relationship between subjective sleep quality, measured using the PSQI (Pittsburgh Sleep Quality Index) questionnaire, and statistical language learning in adults. In two separate studies, participants' performances in statistical language learning were measured: Study 1 (N = 97) and its replication in Study 2 (N = 120). Both studies utilised the two-alternative forced choice (2AFC) recognition task, complemented by a confidence judgement rating. The results showed a significant learning effect above chance in both studies, highlighting adults' capability for statistical language learning. Explicit learning mechanisms significantly contributed to statistical language learning, highlighting the vital role of the declarative memory hippocampal-prefrontal cortex system in adult statistical language learning. Study 1 found that a logarithmic model most suitably represented the relationship between subjective sleep quality and statistical language learning performance. This model showed an initial drop in learning performance as subjective sleep quality declined, but performance stabilised with a further decline in subjective sleep quality. However, this relationship was not statistically significant in Study 2. While this research provides novel insights into the interplay between sleep quality and statistical language learning, future studies should consider subjective and objective sleep measures for a more comprehensive investigation. The research findings have implications for understanding the cognitive mechanisms underpinning language learning and the potential influence of sleep quality on these processes.
  • Nikula, Ottilia (2023)
    Recent progress in natural language generation tools has raised concerns that the tools are being used to generate neural fake news. Fake news impacts our society in many ways, and they have been used for monetization schemes, to tip political elections, and have been shown to have a severe effect on people’s mental health. Accordingly, being able to detect neural fake news and countering their spread is becoming increasingly important. The aim of the thesis is to explore whether there are linguistic features that can help detect neural news. Using Grover, a neural language model, I generate a set of articles based on both real and fake human-written news. I then extract a range of linguistic features, previously found to differ between human-written real and fake news, to investigate whether the same features can be used detect Grover-written news, whether there are features that can differentiate between Grover-written news, whose source material is different, and whether based on these features Grover-written news are more similar to real or fake news. The data consists of 64 articles, of which 16 are real news sourced from reputable news sites and 16 are fake news articles from the ISOT Fake News Dataset. The other 32 articles are written by Grover, with having either the real news or fake news articles as source text (16 each). A broad range of linguistic features are extracted from the article bodies and titles to capture the style, complexity, and sentiment of the articles. The features measured include punctuation, quotes, syntax tree depths, and emotion counts. The results show that the same features which have been found to differ between real and fake news, can with some limitations be used to discern Grover Fake News (Grover-written articles based on fake news). However, Grover Real News (Grover-written articles based on real news) cannot reliably be discerned from real news. Moreover, while the features measured do not provide a reliable method for discerning Grover Real News and Grover Fake News from each other, there are still noticeable differences between the two groups. Grover Fake News can be differentiated from real news, but the texts can be considered of better quality than fake news. These findings also align with previous research, showcasing that Grover is adept at re-writing misinformation and making it more credible to readers, and that feature extraction alone cannot reliably distinguish neural fake news, but that human evaluation also needs to be considered.
  • Zhixu, Gu (2023)
    Neural machine translation (NMT) has been a mainstream method for the machine translation (MT) task. Despite its remarkable progress, NMT systems still face many challenges when dealing with low-resource scenarios. Common approaches to address the data scarcity problem include exploiting monolingual data or parallel data in other languages. In this thesis, transformer-based NMT models are trained on Finnish-Simplified Chinese, a language pair with limited parallel data and the models are improved using various techniques such as hyperparameter tuning, transfer learning and back-translation. Finally, the best NMT system is an ensemble model that combines different single models. The results of our experiments also show that different hyperparameter settings can cause a performance gap of up to 4 BLEU scores. The ensemble model shows a 35% improvement over the baseline model. Overall, the experiments suggest that hyperparameter tuning is crucial for training vanilla NMT models. Back-translation offers more benefits for model improvement than the transfer learning method. The results also show that adding sampling in back-translation does not improve NMT model performance in this low-data setting. The findings may be useful for future research on low-resource NMT, especially the Finnish-Simplified Chinese MT task.
  • Matysek, Ida (2023)
    The linguistic landscape of the Podlasie region in Poland is characterized by the presence of multiple minority languages, particularly local dialects influenced by Belarusian and Ukrainian. Traditionally, Polish, Belarusian, Ukrainian, and Lithuanian languages have been spoken in the area. Currently, Polish is the majority language and Belarusian has the status of an official supporting language in 5 municipalities. As a result of extended language and culture contact multiple vernaculars (called here Podlachian Varieties) and a local identity has emerged. This sociolinguistic questionnaire-based study explores the relationship between minority language attitudes and identities found in multilingual young adults (aged 18 to 29) from Podlasie. This study adopts the poststructuralist understanding of identity as fluid, multidimensional, and socially constructed (Hall 1999, Norton 2013). As Anchimbe (2007) underlines language is an important marker of identity especially in heterogenous communities as individuals and groups need to establish their boundaries to safeguard what they perceive as their distinct characteristics. Attitudes towards a language may determine whether it will head towards extinction or preserve in the community. This study approaches the issue of minority language speakers’ attitudes using Communication Accommodation Theory, developed by Giles. In CAT individuals adjust their communication styles to either converge or diverge with others based on their social motivations, underlining either similarities or differences respectively. The analysed material was gathered through an online questionnaire in December 2020. The questionnaire consisted of 23 questions and received 391 responses, out of which 39 were discarded due to irrelevance. Two-thirds of the participants believed that Podlachian Varieties are disappearing due to passing of older generations, lack of intergenerational language transmission, and the young generation feeling ashamed of the language. Those reasons demonstrate belief in the low perceived status of the language varieties leading to a converging communication strategy towards the Polish majority, which in turn results in intergenerational language shift and identity accommodation. This confirms analysis of Barszczewska (2010), who observed integration process and language shifts in the population. Polish identity holds the dominant position among the group. Belarusian identity was seldom declared (5%). In respect of identity, divergence and assimilation tendencies can be observed. People with local identity strive to diverge from both Polish and Belarusian identities, with the stronger trend seen in diverging from Belarusian. The assimilation trend is seen in native speakers of Belarusian, as nearly half of them identified as Polish and one-third as local. In the light of this study, it is evident that the Varieties are vulnerable and if the situation does not change in the close future, their continued existence might be threatened. The occurring assimilation and language shift poses a great threat to the vitality of Podlachian Varieties and the rapidly progressing urbanization process will continue to foster the language shift towards Polish.
  • Hyttinen, Saana (2022)
    This thesis explores the language practices, attitudes, and identities of multilingual couples that use English as a lingua franca in the relationship (ELF couples). The goal is to investigate how these couples utilize their multilingual resources and if they report using translanguaging or other language mixing practices. As a part of ELF couples’ language practices, the family language practices of families formed by ELF couples as parents are also addressed. Furthermore, the study aims to find out what kinds of attitudes ELF couples have towards translanguaging, as well as how the use of English as a lingua franca shows in their language identities. Earlier research has shown that translanguaging is an essential part of the use of English as a lingua franca especially in the context of informal social contact and close relationships. However, ELF couples as a target group have been studied little and most of the research so far has been qualitative. The focus in this thesis is quantitative, and the study was conducted using an online questionnaire which received 563 suitable responses. The main findings show that while the primary language used in ELF couples’ conversations is usually English, also the partners’ first languages are used to a varying extent. Translanguaging is present in ELF couples’ language practices also in larger scale, even though varying results regarding this aspect showcase the uniqueness of individual couples’ language practices. Moreover, the couples have positive attitudes towards language mixing in general, and many of them respond to it in a relaxed manner. Regarding ELF couples’ language identities, the data shows that the couples often identify themselves as English-speakers but also multilinguals, both individually and as a couple. Consequently, English as a lingua franca seems to have an important role in the relationships, and many of the couples report difficulties in attempts or even unwillingness to change the main language of the relationship to something else than English after having started the relationship using English as a lingua franca. The results also show that language mixing is used much less in the family context when addressing children, and that children seem to be one of the main triggers for more conscious language practices.
  • Alminas, Juozas (2023)
    Adopting the narrative approach of linguistic biographies as the data collection method, this thesis explores the linguistic practices and ideologies of Tibetans living in Finland. Although the presence of many multilingual communities in Finland is known, not many studies on the topic have been done, and there hasn’t been any previous work involving Tibetan speakers. I was curious as to what Tibetans themselves think about their language and the ways to maintain it in an expatriate setting. I came to discover, that the present-day linguistic situation and linguistic attitudes can only be understood through the socio-cultural landscape of consultants’ native Sikkim in India. Through this research I hope to answer two main questions: what are Tibetans’ linguistic ideologies and how do the consultants’ multilingual practices manifest in daily life? The collected data is based on fieldwork interviews conducted with Tibetan consultants. In line with a more inclusive approach towards the linguistic fieldwork, I have tried to present the speakers through their own words, allowing them to speak for themselves. The lives of the consultants have been shaped in the highly multilingual landscape of Sikkim. The linguistic ideologies are deeply rooted within that landscape, but also within the Tibetan Buddhism. Consequently, the puristic ideologies and expectations of a good linguistic performance can sometimes overshadow and hinder Tibetan language learning. However, the demands of the present world are beginning to reshape individuals’ identities, whereby the linguistic performance is not anymore a preclusion for linguistic and ethnic belonging. In the second part of the thesis I analyze how the consultants’ linguistic ideologies have been shaped and what languages have a performative function and in what contexts. I go on to discuss the linguistic practices of the consultants and propose the label ‘translanguaging’ as the most adequate do describe their multilingual performance. The results of the study showcase a multilayered and complex linguistic and social landscape in which Tibetans live. I suggest that the studies geared towards small-scale multilingualism could offer a deeply holistic approach through which to study such landscapes and situations. Which in turn would shine more light on language vitality and its usage. The study’s findings suggest that the vitality of Tibetan language lies in its ability to adapt to the speakers’ world and mix fluidly with other languages. With this work I hope to bring forth the importance of individuals’ ideologies in studying linguistic change and contribute to our understanding of complex multilingual practices.
  • Kylliäinen, Ilmari (2022)
    Automatic question answering and question generation are two closely related natural language processing tasks. They both have been studied for decades, and both have a wide range of uses. While systems that can answer questions formed in natural language can help with all kinds of information needs, automatic question generation can be used, for example, to automatically create reading comprehension tasks and improve the interactivity of virtual assistants. These days, the best results in both question answering and question generation are obtained by utilizing pre-trained neural language models based on the transformer architecture. Such models are typically first pre-trained with raw language data and then fine-tuned for various tasks using task-specific annotated datasets. So far, no models that can answer or generate questions purely in Finnish have been reported. In order to create them using modern transformer-based methods, both a pre-trained language model and a sufficiently big dataset suitable for question answering or question generation fine-tuning are required. Although some suitable models that have been pre-trained with Finnish or multilingual data are already available, a big bottleneck is the lack of annotated data needed for fine-tuning the models. In this thesis, I create the first transformer-based neural network models for Finnish question answering and question generation. I present a method for creating a dataset for fine-tuning pre-trained models for the two tasks. The dataset creation is based on automatic translation of an existing dataset (SQuAD) and automatic normalization of the translated data. Using the created dataset, I fine-tune several pre-trained models to answer and generate questions in Finnish and evaluate their performance. I use monolingual BERT and GPT-2 models as well as a multilingual BERT model. The results show that the transformer architecture is well suited also for Finnish question answering and question generation. They also indicate that the synthetically generated dataset can be a useful fine-tuning resource for these tasks. The best results in both tasks are obtained by fine-tuned BERT models which have been pre-trained with only Finnish data. The fine-tuned multilingual BERT models come in close, whereas fine-tuned GPT-2 models are generally found to underperform. The data developed for this thesis will be released to the research community to support future research on question answering and generation, and the models will be released as benchmarks.
  • Raatikainen, Riikka (2022)
    Tutkielma käsittelee optimismivinouman esiintymistä tulevaisuusskenaarioissa, joiden aiheena on ilmastonmuutos. Siinä missä skenaariomenetelmän käyttö voi vähentää tiettyjen kognitiivisten vinoumien vaikutusta tulevaisuutta koskevissa arvioissa, toiset vinoumat voivat puolestaan haitata skenaarioiden laatimista ja arviointia. On arveltu, että useissa eri konteksteissa esiintyvä optimismivinouma näyttäytyisi myös skenaariomenetelmän yhteydessä. Tutkimus selvittää kokeellisesti, esiintyykö ilmastonmuutosaiheisten skenaarioiden arvioinneissa optimismivinoumaa, eli pitävätkö koehenkilöt positiivisia skenaarioita muita todennäköisempinä. Lisäksi tarkastellaan, onko skenaario-optimismi yhteydessä optimismivinoumaan toisessa kontekstissa mitattuna sekä muihin muuttujiin. Tutkimuskysymysten selvittämiseksi koostettiin kyselylomake, joka lähetettiin Helsingin yliopiston ainejärjestöjen sähköpostilistoille. Kyselyyn tuli 182 vastausta. Tutkittaville esitettiin neljä skenaariota, jotka vaihtelivat positiivisesta negatiiviseen, ja ne käsittelivät saimaannorpan selviytymistä ja kannan kokoa 50 vuoden päästä. Koehenkilöiden tuli asettaa skenaariot todennäköisyysjärjestykseen, jonka pohjalta kullekin vastaajalle laskettiin tietty optimistisuuden taso. Keskimäärin vastaajat olivat pessimistisiä arvioissaan, ja tämä optimistisuuslukema jäi alle neutraalina pidetyn arvon. Skenaarioarvioissa ei siis esiintynyt optimismivinoumaa. Optimismivinoumaa mitattiin myös laittamalla koehenkilöt arvioimaan eri elämäntapahtumien todennäköisyyksiä omalla kohdallaan verrattuna muihin. Näissä kysymyksissä optimismivinoumaa esiintyi, sillä vastaajat arvelivat keskimäärin kokevansa positiivisia tapahtumia muita todennäköisemmin ja negatiivisia muita epätodennäköisemmin. Elämäntapahtumaoptimismin määrä myös korreloi positiivisesti skenaario-optimismin kanssa. Lomakkeella selvitettiin myös muiden muuttujien yhteyttä skenaarioarviointien mahdolliseen optimismivinoumaan. Yleisen optimismin tasoa selvitettiin valmiilla kyselyllä, mutta tämä ei korreloinut skenaario-optimismin kanssa. Ilmastonmuutosasenne puolestaan korreloi negatiivisesti skenaario-optimismin kanssa, eli ilmastonmuutokseen vakavasti suhtautuvat arvioivat skenaarioita pessimistisemmin. Vastaajien ikä, sukupuoli tai saimaannorppatiedon määrä ei vaikuttanut skenaarioarviointeihin. Optimismivinouman puute skenaarioarvioissa oli yllättävä tulos, jonka tarkkaa syytä ei voida sanoa täsmällisesti. Tämä voi johtua joko skenaariomenetelmän kognitiivisia vinoumia vähentävästä vaikutuksesta tai skenaarioiden aiheena olleen ilmastonmuutoksen herättämistä negatiivisista mielikuvista. Olisikin tarvetta tutkia aihetta lisää edustavammalla otoksella sekä tutkimusasetelmalla, joka erottelisi skenaariomenetelmän ja ilmastonmuutosaiheen vaikutukset toisistaan. Skenaarioiden käytön kannalta optimismivinouman puute voidaan kuitenkin nähdä hyvänä asiana.
  • Salmi, Vili (2023)
    Tässä maisterintutkielmassa pyrin kuvaamaan ruotsin opettamisen lopputuloksia arkisessa ympäristössä eli kauppakeskuksissa. Tämän lisäksi pyrin kuvaamaan ruotsin statusta pääsääntöisesti pakollisena kouluaineena koskevaa keskustelua sekä ilmiötä itsessään. Aihettani kuvaakin parhaiten sana ”pakkoruotsi”, sillä aiheesta käytävä keskustelu on osasyy itse oppimistulosten heikkouteen, mutta ennen kaikkea aiheen toistuvuus ja lähes ikuinen ajan-kohtaisuus toimi kohdallani alkuperäisenä tutkimuksen alulle panneena syynä. Pyrin kuvaamaan aihetta sen ansaitsemalla monipuolisella ja moniulotteisella lähestymistavalla kontrastina aiheen pelkälle vastustamiselle ja puolustamiselle. Oma panokseni aiheeseen on lahtelaisissa ja helsinkiläisissä kauppakeskuksissa toteutettu kyselytutkimus, jossa pyrin kartoittamaan kauppakeskuksien työntekijöiden käsitystä ruotsin taidon arvostamisesta työnantajien taholta, asiakaspalvelijoiden todennäköisyyttä ainakin edes yrittää palvella ruotsia puhuvaa asiakasta ruotsiksi sekä ruotsin käytön tarvetta asiakaspalvelutyössä. Lisäksi halusin tietää kyselyyn vastanneiden ruotsinkielisen viihteen kulutuksesta sekä uskomuksista pakkoruotsikysymykseen liittyen.
  • Koivusalo, Liisa (2022)
    Speaking fluently is an important goal for second language (L2) learners. In L2 research, fluency is often studied by measuring temporal features in speech. These features include speed (rate of speech), breakdown (use of silent and filled pauses), and repair (self-corrections and repetitions) phenomena. Fluent speakers generally have a higher rate of speech and fewer hesitations and interruptions than beginner language learners. In this thesis, phonetic fluency of high school students’ L2 Finnish speech is studied in relation to human ratings of fluency and overall proficiency. The topic is essential for the development of automated assessment of L2 speech, as phonetic fluency measures can be used for predicting a speaker’s fluency and proficiency level automatically. Although the effect of different fluency measures on perceived fluency level has been widely studied during the last decades, research on phonetic fluency in Finnish as L2 is still limited. Phonetic fluency in high school students’ speech in L2 Finnish has not been studied before. The speech samples and ratings used in this thesis are a part of a larger dataset collected in the DigiTala research project. The analyzed data contained spontaneous speech samples in L2 Finnish from 53 high school students of different language backgrounds. All samples were assessed by expert raters for fluency and overall proficiency. The speech samples were annotated by marking intervals containing silent pauses, filled pauses, corrections and repetitions, and individual words. Several phonetic fluency measures were calculated for each sample from the durations of the annotated intervals. The contribution of phonetic fluency measures to human ratings of fluency and proficiency was studied using simple and multiple linear regression models. Speech rate was found to be the strongest predictor for both fluency and proficiency ratings in simple linear regression. Articulation rate, portion of long silent pauses, mean duration of long silent pauses, mean duration of breaks between utterances, and rate of short silent pauses per minute were also statistically significant predictors of both fluency and proficiency ratings. Multiple linear regression models improved the simple models for both fluency and proficiency: for fluency, a model with a combination of articulation rate and the portion of long silent pauses performed the best, and for proficiency, a model with a combination of speech rate and mean duration of short silent pauses. Perceived fluency level is often affected by a combination of different phonetic fluency measures, and it seems that human raters ground their assessments on this combination, although some phonetic fluency measures might be more important on their own than others. The findings of this thesis expand previous knowledge on phonetic fluency in L2 Finnish and can benefit both language learners and teachers, as well as developers of automatic assessment of L2 speech.
  • Keturi, Joonas (2022)
    The subject of the thesis is the comparison of lexical semantics and phonetics. The thesis investigates with computational methods if there is significantly more phonetic variance in words that belong to the same semantic domains than with phonetically similar words from other semantic domains. In other words, phonetically very similar words and especially phonological minimal pairs would be in separate semantic domains. The method clusters word embedding vectors and distinctive phonological feature vectors from multiple languages, and the phonetic and semantic standard deviations are calculated for each cluster, and the mean standard deviations of cluster sets are compared. In addition to semantic and phonetic clusters, two test clusters are constructed which have the same number and the same size of clusters as the semantic clusters. The first test clusters use the words from phonetic clusters in order and the second test clusters are randomly permuted. These different cluster sets are compared by their mean standard deviations and cluster set similarity index. The results imply that words on the same semantic domains contain rarely phonetically very similar words, and those words are usually in separate semantic domains.
  • Božović, Dušica (2023)
    The aim of this research was to investigate the teaching of pluricentric languages as heritage languages in Finland, examine how they are perceived, and explore the expectations related to their teaching. Moreover, the study aimed to identify successful approaches in the teaching of pluricentric heritage languages. The motivation for conducting this study was my personal experience of teaching a pluricentric language as a heritage language and the limited coverage of this topic in academic literature. In addition, the lack of attention paid to attitudes in heritage language studies was also noted in the literature. The method used is a direct measures approach. Respondents provided their answers through a questionnaire predominantly including Likert-scale statements. The findings indicate that there is a desire to improve communication among the stakeholders in heritage language teaching. Respondents expressed positive attitudes towards groups with different language varieties and active inclusion of different varieties in class. They believed that all varieties should be treated as equally valid, and teachers should not treat forms of other varieties as mistakes. Studying in a linguistically heterogeneous group was seen as an enriching experience that can contribute to combating prejudices and building solidarity among speakers. The limitations of the study included a small number of respondents and imbalanced material in terms of language. The findings of the study have practical implications for heritage language coordinators and educators in their planning and teaching activities, as well as for policymakers seeking to enhance heritage language education. Additionally, the study advances the academic discourse on heritage language teaching and suggests areas for further research. Heritage language teaching in general requires significant improvement to achieve its aims. The study highlights the importance of addressing issues in pluricentric heritage language teaching and implementing strategies that promote positive attitudes towards language varieties and effective communication between coordinators, teachers, and guardians.
  • Hynynen, Jussi-Veikka (2023)
    Using language that is easy to understand when presenting information in a written form is critical for ensuring effective communication. Yet, using language that is too complex or technical for its intended audience is a common pitfall in many domains, such as legal and medical text. Automatic text simplification (ATS) aims to automatize the conversion of complex text into a simpler, more easily comprehensible form. This study explores ATS models for English that can be controlled in terms of the readability of the output text. Readability is measured with an automatically calculated readability level that corresponds to a school grade level. The readability- controlled models take a readability level as a parameter and simplify input text to match the reading level of the intended audience corresponding to the parameter value. In total, six readability-controlled sentence simplification models with different control attribute configurations are trained in this study. The models use a pretrained sequence-to-sequence model architecture that is finetuned on a dataset of sentence pairs in regular and simple English. The trained models are evaluated using automatic evaluation metrics and compared to each other and ATS systems from previous research. Additionally, the simplified sentences produced by the best performing model are evaluated manually to identify errors and the types of text transformations that the model employs to simplify sentences. When the readability level input value is optimized to maximise model performance on validation data, the readability-controlled models surpass systems from previous works in terms of automatic evaluation metrics, suggesting that the addition of readability level as a control attribute results in improved simplification quality. Manual evaluation shows that readability-controlled models are capable of splitting long sentences to multiple shorter sentences to reduce syntactic complexity of text. This finding suggests that readability level metrics can be used to effectively control syntactic complexity in ATS models as a lightweight alternative to previously applied, more computationally demanding methods that rely on dependency parsing. Finally, this study discusses the different types errors produced by the models, their potential causes and ways to reduce errors in future ATS systems.
  • Pöyhönen, Teemu (2023)
    While natural language generation (NLG) and large-language models (LLM) seem to be transforming many industries, video games have yet to be affected. This study investigates the potential of using NLG systems to generate dialogue for non-playable characters (NPCs) in role-playing games (RPGs). For this, dialogue data is extracted from six popular RPGs and is then used to fine-tune Microsoft’s GODEL to create an “RPG chatbot” (RPG-GPT). Motivated by computational creativity frameworks, a survey and an interactive experiment were conducted to evaluate the creativity and the effectiveness of RPG-GPT in generating relevant and engaging responses to player input. Survey respondents rated dialogues on a 5-point agree-disagree Likert scale, with questions related to e.g. the relevance of the NPC answers. Results indicate that RPG-GPT can provide relevant responses with a mean difference of game relevance of 3.93 vs. 3.85 of RPG-GPT (p=0.0364). Also, the participants of the interactive experiment reported engagement when interacting with RPG-GPT. Overall, the results suggest that creative NLG has the potential to enhance gaming experiences through task-oriented game dialogue (TOGD) systems. In this framework, creative TOGD systems could solve a common issue where pre-written NPCs are unable to provide the specific information sought by players. Additionally, the study discusses a concept of how players through their interaction with the NLG models can expand the lore of a game, which is a new consideration for game designers and developers when implementing such systems. Future work could explore ways to incorporate external knowledge and context to improve the performance of a TOGD system.
  • Pöllänen, Roosa (2022)
    In earlier research, the sociative causative has been considered a subcategory of a prototypical causative and not a category of its own. In the sociative causative the causer both initiates the event and participates in it, unlike in the prototypical causative in which the causer is only the initiator. It has been proposed that the causer can participate in the event either by acting together with the causee, helping the causee, or supervising the causee. The sociative causative can be marked on the predicate by using a specific sociative causative marker or it can be a reading of a prototypical causative construction or a reading of an applicative. The objective of the thesis is twofold. First, the intention is to find out, using a typological sampling method, if there are more languages with a specific sociative causative construction beyond those that are currently known and, second, how these constructions behave. Special attention is paid to the exact semantics of the sociative causation to see if it reflects the semantics proposed in the earlier literature. The contexts in which the prototypical causatives and applicatives can get the sociative reading are also studied. The intention is to find out where the sociative causative aligns in the causative continuum. It has been proposed in the previous literature that the sociative causative is an areal feature of the South American indigenous languages, and 26 languages were previously known to have sociative causative. In addition to these 26 languages, a genealogically balanced sampling method was applied and four languages with sociative causative function were found. Since South America is one of the world’s most linguistically diverse areas the data gathering was limited to the western part of the continent. The 30 languages were analyzed formally and semantically. The analysis shows that the sociative causative usually describes the type of causation in which the causer is a co-actor with the causee or the causer helps the causee. The supervision type of sociative causation, however, occurred rarely. The sociative causative tends to be used with intransitive verbs that express motion or physical activity. In the causative continuum it seems to be in the middle, as the previous research proposes.
  • Myllylä, Ida-Lotta (2023)
    This thesis investigated a sound-space phenomenon related to sound-symbolic associations between vowel sounds [i] and [æ] and spatial meanings up and down. This vowel-height congruency effect was investigated with two experiments utilizing speeded choice reaction time (CRT) tasks. In Experiment 1, participants were required to vocalize [i] or [æ] while being presented with visual stimuli moving either up or down. The task was indirect, so that the phenomenon under investigation was masked by instructing the vocalizations to be produced according to distance of movement, rather than location. Due to this masking, the sound-magnitude effect typically associating high (close) vowels with small distances and low (open) vowels with large distances was also investigated in this thesis. In Experiment 2, participants produced responses according to the location of visual stimulus (up/down) or according to the aurally presented vowels [i] and [æ], while being presented with both stimuli simultaneously. In both experiments, reaction time (RT) measures were analyzed. In Experiment 1, acoustic characteristics (fundamental frequency F0, and formants F1, F2) of the vocalizations were also analyzed. The results showed, that there is a sound-symbolic association between the vowel [i] and spatial meaning up, based on the stimulus-response congruency observed in reaction time measures. The sound-magnitude effect was also found to be robust in these experiments. The sound-space association between [æ] and spatial meaning down was not found to be significant. The sound-space effect also emerged only in the experiment requiring vocalizations, and not in the experiment requiring manual responses. The sound-space effect was present in the reaction time measures, and not in the vocal characteristics of vocalizations. It was concluded, that the vowel-height congruency effect can be robustly observed (i.e., in relation to both vocal responses) only when the experimental task requires intentional and task-relevant processing of the concepts up and down. It was also estimated, that the sound-space effect related to vowel sounds [i] and [æ] and spatial meanings up/down may not be as strong, as for instance the sound-magnitude effect. Regarding the possible underlying mechanisms of sound-symbolic associations, some evidence supporting the embodiment-based articulatory views on sound symbolism was found. In addition, the intrinsic vowel pitch (IVP) phenomenon was replicated in this thesis, and it was demonstrated, that the intrinsic pitch is an important core property of vowel sounds that influences also sound-symbolic associations.
  • Peura, Telma (2023)
    Maisterintutkielmassani tutkin kvantitatiivisin metodein, miten suomenkielisten romaanijulkaisujen monimuotoisuus on kehittynyt viimeisen 50 vuoden aikana. Tutkimukseni perustuu Kirjasampo-tietokannan metadataan suomalaisten julkisten kirjastojen kokoelmasta, ja keskityn analysoi- maan tekstin ulkoisia piirteitä. Moninaisuuden indikaattoreina käytän kirjoituskieliä, kirjailijoiden kansalaisuutta ja sukupuolta sekä romaanien genreluokituksia. Lisäksi tarkastelen julkaisijoita ja pohdin, kuinka he toimijaryhmänä vaikuttavat kirjallisuuden monimuotoisuuteen. Kvantitatiivisten analyysien rinnalla kuljetan digitaalisille ihmistieteille tyypillisesti runsaasti kvalitatiivisia havaintoja taustoittamaan tuloksia. Lähestyn kirjallisuutta kansainvälisenä dynaamisena kokonaisuutena, jossa eri kirjalliset kulttuurit ovat vuorovaikutuksessa toisiensa kanssa, muodostaen kirjalliseen tilaan paikallisia keskuksia ja periferioita. Ylirajaisuuden käsitteen avulla kuvaan, kuinka globalisoituvaa kirjallista kenttää on mahdotonta rajata kokonaan erillisiin kirjallisuuksiin, vaan se kehittyy yli kansallisuus-, kieli- ja genrerajojen. Tulokset osoittavat, että romaanikirjallisuus on 1990-luvun jälkeen alkanut kehittyä monimuotoisemmaksi määrittelemieni indikaattoreiden perusteella. Silti kenttää hallitsevat kotimaisen kirjallisuuden osalta suomenkielinen ja käännöskirjallisuuden osalta angloamerikkalainen sekä pohjoismainen kirjallisuus. Kustantajien tarkastelu viittaa siihen, että kentällä on paljon erikokoisia toimijoita. Erityisesti vuosituhannen vaihteen jälkeen pienten toimijoiden sekä omakustannejulkaisujen osuus on kasvanut ja haastanut kustantajien perinteisen roolin kirjallisuuden portinvartijana. Tutkimus osoittaa, kuinka kirjastojen metadataa voidaan käyttää hyväksi digitaalisessa kirjallisuudentutkimuksessa. Runsaudessaan Kirjasampo osoittautui monipuoliseksi tietolähteeksi, jonka perusteella voi tehdä päätelmiä suomalaisen kirjallisuuden laajoista kehityskaarista.
  • Bedretdin, Ümit (2022)
    Tämä työ esittelee ohjattuun koneoppimiseen perustuvan tekstiluokittelijan kehitysprosessin mediatutkimuksen näkökulmasta. Valittu lähestymistapa mahdollistaa mediatutkijan asiantuntijatiedon valjastamisen laaja-alaiseen laskennalliseen analyysiin ja suurten aineistojen käsittelyyn. Työssä kehitetään neuroverkkopohjainen tekstiluokittelija, jonka avulla vertaillaan tekstistä erotettujen erilaisten luokittelupiirteiden kykyä mallintaa journalististen tekstien kehystystaktiikoita ja aihepiirejä. Kehitystyössä käytetyt aineistot on annotoitu osana kahta mediatutkimusprojektia. Näistä ensimmäisessä tutkitaan tapoja, joilla vastamedia MV-lehti uudelleenkehystää valtamedian artikkeleita. Siinä on aineistona 37 185 MV-lehden artikkelia, joista on eristetty kolme erilaista kehystystaktiikkaa (Toivanen et al. 2021), jotka luokittelijan on määrä tunnistaa tekstistä automaattisesti. Toisessa projektissa keskiössä on valtamedioissa käyty alkoholipolitiikkaa koskeva keskustelu, jota varten kerättiin 33 902 artikkelin aineisto Ylen, Iltalehden ja STT:n uutisista (Käynnissä oleva Vallan virrat -tutkimusprojekti). Luokittelijan tehtävänä on tunnistaa aineistosta artikkelit, jotka sisältävät keskustelua alkoholipolitiikasta. Työn tarkoituksena on selvittää, mitkä tekstin piirteet soveltuvat parhaiten luokittelupiirteiksi kulloiseenkin tehtävään, ja mitkä niistä johtavat parhaaseen luokittelutarkkuuteen. Luokittelupiirteinä käytetään BERT-kielimallista eristettyä virketason kontekstuaalista tietoa, artikkelin muotoiluun liittyviä ominaisuuksia, kuten lihavointeja ja html-koodia, ja aihemallinnuksen avulla tuotettuja artikkelikohtaisia aihejakaumia. Alustavat kokeet pelkästään kontekstuaalista tietoa hyödyntävällä luokittelijalla olivat lupaavia, mutta niidenkään tarkkuus ei yltänyt tarvittavalle tasolle. Oli siis tarpeen selvittää, paraneeko luokittelijan suorituskyky yhdistelemällä eri piirteitä. Hypoteesi on uskottava, sillä esimerkiksi BERT-pohjaiset upotukset koodaavat muutaman virkkeen pituisen sekvenssin lingvististä ja jakaumallista informaatiota, kun taas aihemalli sisältää laajempaa rakenteellista informaatiota. Nämä piirteet täydentäisivät toisiaan artikkelitason luokitustehtävässä. Yhdistelemällä tekstien kontekstuaalista informaatiota aihemallinnukseen on hiljattain saavutettu parannuksia erilaisissa tekstinluokittelutesteissä ja sovelluksissa (Peinelt et al. 2020, Glazkova 2021). Yhdistämällä kontekstuaaliset piirteet aihemallin informaatioon päästään tässä työssä tosin vain marginaalisiin parannuksiin ja vain tietyissä ympäristöissä. Tästä huolimatta kehitetty luokittelija suoriutuu monesta luokittelutehtävästä paremmin kuin pelkästään kontekstuaalisia piirteitä hyödyntävä luokittelija. Lisäksi löydetään potentiaalisia kehityskohteita, joilla voitaisiin päästä edelleen parempaan luokittelutarkkuuteen. Kokeiden perusteella kehysanalyysiin perustuva automaattinen luokittelu neuroverkkojen avulla on mahdollista, mutta luokittelijoiden tarkkuudessa ja tulkittavuudessa on vielä kehityksen varaa, eivätkä ne vielä ole tarpeeksi tarkkoja korkeaa varmuutta vaativiin johtopäätöksiin.
  • Kajala, Jukka (2023)
    According to Malchukov, Haspelmath and Comrie a ditransitive construction is a construction consisting of a ditransitive verb, an agent argument, a recipient-like argument, and a theme argument. The relations between these arguments are coded in languages by different methods, namely flagging, or noun-based marking methods; indexing, or verb-based marking methods; or the relation is determined by word order. Typologically ditransitive construction can be divided into three alignment groups, indirective, secundative or neutral. In indirective alignment the recipient argument is marked using a different marking method from theme and monotransitive patient arguments; in secundative alignment the theme argument is marked using different methods; in neutral alignment all three arguments are marked using the same method. Swahili is a prominent lingua franca spoken in Eastern Africa by approximately 100 million people belonging to the language family of Bantu languages. Swahili is an agglutinative language with rich verbal morphology. The Swahili morphosyntax is based on noun class system, in which each noun belongs to a certain noun class. Briefly, the Swahili verb cluster is constructed by adding subject and object markers, which are determined by the nouns or person affiliated with them, to the verbal root. Swahili verb cluster permits only zero or one object marker. Prior studies on Swahili object marking and ditransitive constructions reveal that the patient argument is marked using indexing. Swahili has no case marking, so no flagging methods are used. In ditransitive constructions the recipient is marked as an object marker to the verb. Because recipient and patient arguments are marked using same method, the alignment type of Swahili ditransitive clauses is secundative. In the early grammars and textbooks, the linear word order of the two overt ditransitive objects is suggested to be recipient first, theme second. Later studies suggest that the order might vary. As a part of this study, a corpus study using the Helsinki Corpus of Swahili was carried out. The findings from the corpus study confirm the later findings, the linear order of the two objects shows variation. The syntactically more heavy objects seems to prefer the position of the later object.
  • Koho, Tiina (2022)
    Tekstin normalisointi on prosessi, jossa epästandardia kirjoitettua kieltä muutetaan standardisoituun muotoon. Murteet ovat yksi esimerkki epästandardista kielestä, joka voi poiketa huomattavastikin standardisoidusta yleiskielestä. Lisäksi suomen kieli on ortografialtaan varsin pitkälti foneemista, minkä ansiosta myös puhutun kielen ominaispiirteet on mahdollista tuoda esille kirjoitetussa muodossa. Etenkin epävirallisilla alustoilla ja arkikielisessä kontekstissa, kuten sosiaalisessa mediassa, suomen kielen puhujat saattavat kirjoittaa sanat kuten ääntäisivät ne normaalisti puhuessaan. Tällaista epästandardista kielestä koostuvaa aineistoa voi löytää myös luonnollisen kielen käsittelyn tarpeisiin esimerkiksi Twitteristä. Perinteiselle yleiskieliselle tekstiaineistolle suunnatut luonnollisen kielen käsittelyn työkalut eivät kuitenkaan välttämättä saavuta toivottavia tuloksia puhekieliselle aineistolle sovellettuna, jolloin ratkaisuna voidaan käyttää välivaiheena tekstin normalisointia. Normalisointiprosessissa syötteenä käytettävä puhekielinen tai muutoin epästandardia kieltä sisältävä teksti muutetaan standardisoituun kirjoitusasuun, jota luonnollisen kielen käsittelyn työkalut paremmin ymmärtävät. Tämä työ pohjaa aiempaan tutkimukseen, jota on tehty suomen murteiden normalisoinnin parissa. Aiemmissa tutkimuksissa on todettu, että merkkipohjaiset BRNN-neuroverkkomallit (Bidirectional Recurrent Neural Nerwork) saavuttavat hyviä tuloksia suomen kielen murteiden normalisoinnissa, kun syötteenä käytetään sanoja kolmen kappaleen lohkoissa. Tämä tarkoittaa, että järjestelmä saa syötteenä kerrallaan kolmen sanan joukon, ja jokainen sana on edelleen pilkottu välilyönnein eroteltuihin kirjoitusmerkkeihin. Tässä työssä pyrittiin käyttämään samoja metodeja ja aineistoa kuin aiemmassa tutkimuksessa, jotta tulokset olisivat vertailukelpoisia. Aineistona on käytetty Kotimaisten kielten keskuksen ylläpitämää Suomen kielen näytteitä -korpusta, ja normalisointiin on käytetty OpenNMT-nimistä avoimen lähdekoodin kirjastoa. Työssä toteutetuista kokeiluista saadut tulokset näyttävät vahvistavan aiempien tutkimustulosten pohjalta tehdyt löydökset, mutta lisäksi on viitteitä siitä, että neuroverkkomallit saattaisivat pidemmistä lohkoista koostuvista syötteistä. BRNN-mallin lisäksi työssä kokeillaan myös muita neuroverkkoarkkitehtuureja, mutta vertailtaessa sanavirheiden suhdelukua mittaavaa WER-arvoa (Word Error Rate) voidaan todeta, että BRNN-malli suoriutuu normalisointitehtävästä muita neuroverkkoarkkitehtuureja paremmin.