Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Author "Rahman, Dean"

Sort by: Order: Results:

  • Rahman, Dean (2022)
    There are comprehensive requirements in Finland for procurement by any government organization to go through a tendering process where information about each tender is made available not only to vendors and service providers, but to everyone else in Finland as well. This is accomplished through the website Hilma and should make tenders easy to find. Moreover, in Finnish, variance in domain terminology is not thought to be the problem that it is in English. For instance, the last four years of tenders on Hilma never refer to jatkuva parantaminen as toiminnallinen erinomaisuus whereas “continuous improvement” and “operational excellence” could be used interchangeably in English. And yet, it is considered very difficult for a vendor or service provider to find applicable tenders on Hilma. Rather than lexical variability being the cause as it might be in English, the differences in concept paradigms between the private and public sectors in Finland pose the challenge. Whereas a taxi company representative would be looking for tenders about transportation services, a public officer could be posting a tender about social equity for the disabled. The second difficulty is that the Hilma search engine is purely Boolean with restrictive string match criteria rather than inviting natural language questions. Finally, the Hilma search engine does not account for Finnish being a highly inflecting and compounding language where single words usually morph instead of taking on adpositions, and where compound words are affixed together without hyphenation. Many information retrieval approaches would look outside the corpus for query expansion terms. Natural language processing might also offer the potential to look for paraphrases in existing parallel corpora on tenders throughout the European Union rather than in Hilma. However, this thesis focuses on clustering the tenders posted in Finnish on Hilma, applying the comprehensive workflow of the very recent BERTopic package for Python. All documents in each cluster are concatenated and the highest TFIDF-scoring words in the concatenated document are slated to be “search extension terms.” If one of the terms were to be entered by a Hilma user, the user would be invited to perform parallel searches with the remaining terms as well. The first main contribution of this thesis is to use state of the art models and algorithms to represent the corpus, reduce dimensionality of the representations and hierarchically cluster the representations. Second, this thesis develops analytical metrics to be used in automatic evaluation of the efficacy of the clusterings and in comparisons among model iterations that programmatically remove more and more distractions to the clustering that are discovered in the corpus. Finally, this thesis performs case studies on Hilma to demonstrate the remarkable efficacy of the search extension terms in generating tremendous numbers of additional useful matches, addressing paradigm-based differences in terminology, morphovariance and affixation.