Skip to main content
Login | Suomeksi | På svenska | In English

Neural Transfer Learning for Truly Low-Resource Natural Language Processing

Show simple item record

dc.date.accessioned 2023-06-30T05:18:25Z
dc.date.available 2023-06-30T05:18:25Z
dc.date.issued 2023-06-30
dc.identifier.uri http://hdl.handle.net/123456789/47712
dc.title Neural Transfer Learning for Truly Low-Resource Natural Language Processing en
ethesis.discipline.URI "null"
ethesis.department.URI "null"
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/null
dct.creator Soisalon-Soininen, Eliel
dct.issued 2023 und
dct.abstract The vast majority of the world's languages are low-resource, lacking the data resources required in advanced natural language processing (NLP) based on data-intensive deep learning. Furthermore, annotated training data can be insufficient in some domains even within resource-rich languages. Low-resource NLP is crucial for both the inclusion of language communities in the NLP sphere and the extension of applications over a wider range of domains. The objective of this thesis is to contribute to this long-term goal especially with regard to truly low-resource languages and domains. We address truly low-resource NLP in the context of two tasks. First, we consider the low-level task of cognate identification, since cognates are useful for the cross-lingual transfer of many lower-level tasks into new languages. Second, we examine the high-level task of document planning, a fundamental task in data-to-text natural language generation (NLG), where many domains are low-resource. Thus, domain-independent document planning supports the transfer of NLG across domains. Following recent encouraging results, we propose neural network models to these tasks, using transfer learning methods in three low-resource scenarios. We divide our high-level objective into three research tasks characterised by different resource conditions. In our first research task, we address cognate identification in endangered Sami languages of the Uralic family, given scarce labelled training data. We propose a Siamese convolutional neural network (S-CNN) and a support vector machine (SVM), which we pre-train on unrelated Indo-European data, lacking high-resource close relatives. We find that S-CNN performs best at direct transfer to Sami, and adapts fast when fine-tuned on a small amount of Sami data. In our second research task, we address a scenario with only unlabelled data to adapt S-CNN from Indo-European to Uralic data. We propose both discriminative adversarial networks and pre-trained symbol embeddings, finding that adversarial adaptation outperforms an unadapted model, while symbol embeddings are beneficial when languages have disparate orthographies. In our third research task, we address document planning in data-to-text generation of news, in a domain with no annotated training data whatsoever. We propose distant supervision, automatically constructing labelled data from a news corpus, and train a neural model for sentence ordering, a task related to document planning. We examine Siamese, positional, and pointer networks, and find that a variant of S-CNN results in generation with higher human-perceived quality than heuristic baselines. The contributions of this thesis include addressing novel low-resource scenarios considering two NLP tasks, at which the potential of deep learning has not been fully explored. We propose novel approaches to these tasks using neural models in combination with transfer learning, and our experiments indicate their performance in comparison with baselines. Finally, although we acknowledge that rule-based methods and heuristics might still be superior to deep learning in truly low-resource scenarios, our approaches are more language- and domain-independent, supporting a wider coverage of NLP across languages and domains. en
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language englanti fi
ethesis.language English en
ethesis.language engelska sv
ethesis.supervisor Toivonen, Hannu
ethesis.supervisor Granroth-Wilding, Mark
dct.identifier.ethesis E-thesisID:56fbaec7-e6af-4d98-96bd-baf35379e722
dct.identifier.urn URN:NBN:fi:hulib-202306303406

Files in this item

Files Size Format View
Soisalon-Soininen_Eliel_dissertation_2023.pdf 1.025Mb PDF

This item appears in the following Collection(s)

Show simple item record