Browsing by master's degree program "Tietojenkäsittelytieteen maisteriohjelma"

Now showing items 21-40 of 241

Approximating the Permanent of a Matrix with Deep Rejection Sampling

Harviainen, Juha (2021)

Computing the permanent of a matrix is a famous #P-hard problem with a wide range of applications. The fastest known exact algorithms for the problem require an exponential number of operations, and all known fully polynomial randomized approximation schemes are rather complicated to implement and have impractical time complexities. The most promising recent advancements on approximating the permanent are based on rejection sampling and upper bounds for the permanent. In this thesis, we improve the current state of the art by developing the deep rejection sampling method, which combines an exact algorithm with the rejection sampling method. The algorithm precomputes a dynamic programming table that tightens the initial upper bound used by the rejection sampling method. In a sense, the table is used to jump-start the sampling process. We give a high probability upper bound for the time complexity of the deep rejection sampling method for random (0, 1)-matrices in which each entry is 1 with probability p. For matrices with p < 1/5, our high probability bound is stronger than in previous work. In addition to that, we empirically observe that our algorithm outperforms earlier rejection sampling methods by testing it with different parameters against other algorithms on multiple classes of matrices. The improvements in sampling times are especially notable in cases in which the ratios of the permanental upper bounds and the exact value of the permanent are huge.
A real-world network traffic dataset for detecting Denial of Service attacks in a web server environment

Kahilakoski, Marko (2022)

Various Denial of Service (DoS) attacks are common phenomena in the Internet. They can consume resources of servers, congest networks, disrupt services, or even halt systems. There are many machine learning approaches that attempt to detect and prevent attacks on multiple levels of abstraction. This thesis examines and reports different aspects of creating and using a dataset for machine learning purposes to detect attacks in a web server environment. We describe the problem field, origins and reasons behind the attacks, typical characteristics, and various types of attacks. We detail ways to mitigate the attacks and provide a review of current benchmark datasets. For the dataset used in this thesis, network traffic was captured in a real-world setting, and flow records were labeled. Experiments performed include selecting important features, comparing two supervised learning algorithms, and observing how a classifier model trained on network traffic on a specific date performs in detecting new malicious records over time in the same environment. The model was also tested with a recent benchmark dataset.
A Review of Privacy and Security in Federated Multi-armed Bandits

Virtanen, Lasse (2023)

The multi-armed bandit is a sequential decision making problem where an agent chooses actions and receives rewards. The agent faces an explore-exploit dilemma: it has to balance exploring its options to find the optimal actions, and exploiting choosing the empirically best actions. This problem can also be solved by multiple agents who collaborate in a federated learning setting, where agents do not share their raw data samples. Instead, small updates containing learned parameters are shared. In this setting, the learning process can happen with a central server that coordinates the agents to learn the global model, or in a fully decentralized fashion where agents communicate with each other to collaborate. The distribution of rewards may be heterogeneous, meaning that the agents face distributions with local biases. Depending on the context, this can be handled by cancelling the biases by averaging, or by personalizing the global model to fit each individual agent’s local biases. Another common characteristic of federated multi-armed bandits is preserving privacy. Even though only parameter updates are shared, they can be used to infer the original data. To privatize the data, a method known as differential privacy is applied by adding enough random noise to mask the effect of a single contribution. The newest area of interest for federated multi-armed bandits is security. Collaboration between multiple agents means more opportunities for attacks. Achieving robust security means defending against Byzantine attacks that inject arbitrary data into the learning process to affect the model accuracy in an undesirable way. This thesis is a literature review that explores how the federated multi-armed bandit problem is solved, and how privacy and security for it is achieved.
A Safe Plugin System for Learning Management Systems

Nygren, Henrik (2024)

The MOOC Center of University of Helsinki maintains a learning management system, primarily used in the online courses offered by the Department of Computer Science. The learning management system is being used in more courses, leading to a need for additional exercise types. In order to satisfy this need, we plan to use additional teams of developers to create these exercise types. However, we would like to minimize any negative effects that the new exercise types may have on the overall system, specifically regarding stability and security. In this work, we propose a plugin system for creating new exercise types, and implement it to production system used by real students. The system's plugins are deployed as separate services and use sandboxed IFrames for their user interfaces. Communication with the plugins occurs through the use of HTTP requests and message passing. The designed plugin system fulfilled its aims and worked in its production deployment. Notably, it was concluded that it is challenging for plugins to disrupt the host system. This plugin system serves as an example that it is possible to create a plugin system where the plugins are isolated from the host system.
Assessing text readability and quality with language models

Liu, Yang (2020)

Automatic readability assessment is considered as a challenging task in NLP due to its high degree of subjectivity. The majority prior work in assessing readability has focused on identifying the level of education necessary for comprehension without the consideration of text quality, i.e., how naturally the text flows from the perspective of a native speaker. Therefore, in this thesis, we aim to use language models, trained on well-written prose, to measure not only text readability in terms of comprehension but text quality. In this thesis, we developed two word-level metrics based on the concordance of article text with predictions made using language models to assess text readability and quality. We evaluate both metrics on a set of corpora used for readability assessment or automated essay scoring (AES) by measuring the correlation between scores assigned by our metrics and human raters. According to the experimental results, our metrics are strongly correlated with text quality, which achieve 0.4-0.6 correlations on 7 out of 9 datasets. We demonstrate that GPT-2 surpasses other language models, including the bigram model, LSTM, and bidirectional LSTM, on the task of estimating text quality in a zero-shot setting, and GPT-2 perplexity-based measure is a reasonable indicator for text quality evaluation.
Assessing text readability and quality with language models

Liu, Yang Jr (2020)

Automatic readability assessment is considered as a challenging task in NLP due to its high degree of subjectivity. The majority prior work in assessing readability has focused on identifying the level of education necessary for comprehension without the consideration of text quality, i.e., how naturally the text flows from the perspective of a native speaker. Therefore, in this thesis, we aim to use language models, trained on well-written prose, to measure not only text readability in terms of comprehension but text quality. In this thesis, we developed two word-level metrics based on the concordance of article text with predictions made using language models to assess text readability and quality. We evaluate both metrics on a set of corpora used for readability assessment or automated essay scoring (AES) by measuring the correlation between scores assigned by our metrics and human raters. According to the experimental results, our metrics are strongly correlated with text quality, which achieve 0.4-0.6 correlations on 7 out of 9 datasets. We demonstrate that GPT-2 surpasses other language models, including the bigram model, LSTM, and bidirectional LSTM, on the task of estimating text quality in a zero-shot setting, and GPT-2 perplexity-based measure is a reasonable indicator for text quality evaluation.
Assuring Model Documentation in Continuous Machine Learning System Development

Holopainen, Markus (2023)

Context: Over the past years, the development of machine learning (ML) enabled software has seen a rise in popularity. Alongside this trend, new challenges have been identified, such as growing concerns about the use, including the ethical concerns, of ML models, as misuse can lead to severe consequences for human beings. To alleviate this problem, more comprehensive model documentation has been suggested, but how can that documentation be made part of a modern, continuous development process? Objective: We design and develop a solution, which consists of a software artefact and its surrounding process, which enables and moderates continuous documentation of ML models. The solution needs to comply with the modern way-of-working of software development. Method: We apply the design science research methodology to divide the design and development into six separate tasks, i.e., problem identification, objective definition, design and development, demonstration, evaluation, and communication. Results: The solution uses model cards for storing model details. These model cards are tested automatically and manually, forming a quality gate and ensuring integrity of the documentation. The software artefact is implemented in the form of a GitHub Action. Conclusion: We conclude that the software artefact supports and assures proper model documentation in the form of a model card. The artefact allows for customization by the user, thereby supporting domain-specific use cases.
A Systematic Literature Review on the Modularity of Modular Neural Networks and Comparison to Monolithic Solutions

Alho, Riku (2021)

Modularity is often used to manage the complexity of monolithic software systems. This is done through reducing maintenance costs by minimizing the entanglement in software code and functionality. Modularity also lowers future development costs through enabling the reuse and stacking of different types of modular functionality and software code for different environments and software engineering problems. Although there are important differences between the problem solving processes and practices of machine learning system developers and software engineering developers, machine learning system developers have been shown to be able to adopt a lot from traditional software engineering. A systematic literature review is used to identify 484 studies published in four electronic sources from January 1990 to October 2021. After examination of papers, statistical and qualitative results are formed for selected 86 studies which provide sufficient information regarding the presence of modular operators and comparison to monolithic solutions. The selected studies addressed a wide number of different tasks and domains, which saw performance benefits compared to monolithic machine learning and deep learning methods. Nearly two thirds of studies discovered Modular Neural Networks (MNNs) providing improvements in task accuracy when compared to monolithic solutions. Only 16,3\% of studies reported efficiency values in their comparisons. Over 82,5\% of studies that reported their MNNs efficiency found benefits in computation time, memory/size and energy consumption when compared to monolithic solutions. The majority of studies were carried out in laboratory environments on singular focused tasks and static requirements, which may have limited the visibility of modular operators. MNNs show positive promise for performance and efficiency in machine learning. More comparable studies are needed, especially from the industry, that use MMNs in constantly changing requirements and thus apply multiple modular operators.
A Thematic Review of Preventing Bias in Iterative AI Software Development

Jaana, Hautala (2023)

Artificial Intelligence (AI) has revolutionized various domains of software development, promising solutions that can adapt and learn. However, the rise of AI systems has also been accompanied by ethical concerns, primarily related to the unintentional biases these systems can inherit during the development process. This thesis presents a thematic literature review aiming to identify and examine the existing methodologies and strategies for preventing bias in iterative AI software development. Methods employed for this review include a formal search strategy using defined inclusion and exclusion criteria, and a systematic process for article sourcing, quality assessment, and data collection. 29 articles were analyzed, resulting in the identification of eight major themes concerning AI bias mitigation within iterative software development, ranging from bias in data and algorithmic processes to fairness and equity in algorithmic design. Findings indicate that while various approaches for bias mitigation exist, gaps remain. These include the need for adapting strategies to agile or iterative frameworks, resolving the trade-off between effectiveness and fairness, understanding the complexities of bias for tailored solutions, and assessing the real-world applicability of these techniques. This synthesis of key trends and insights highlights these specific areas requiring further research.
Attacks on Smart Contracts

Porkka, Otto (2022)

Blockchain technologies and cryptocurrencies have gained massive popularity in the past few years. Smart contracts extend the utility of these distributed ledgers to distributed state machines, where anyone can store and run code and then mutually agree on the next state. This opens up a whole new world of possibilities, but also many new security challenges. In this thesis we give an up-to-date survey on smart contract security issues. First we give a brief introduction to blockchains and smart contracts and explain the most common attack types and some mitigations against them. Then we sum up and analyse our findings. We find out that many of the attacks could be avoided or at least severely mitigated if the coders followed good coding practices and used design patterns that are proven to be good. Another finding is that changing the underlying blockchain technology to counter the issues is usually not the best way, as it is hard and troublesome to do and might restrict the usability of contracts too much. Lastly, we find out that many new automated tools for security are being developed and used, which indicates movement towards more conventional coding where automated tools like scanners and analysers are being used to cover a large set of security issues.
Augmenting the Student-Teacher Feature Pyramid Matching Method for Better Unsupervised Anomaly Localization

Mylläri, Juha (2022)

Anomaly detection in images is the machine learning task of classifying inputs as normal or anomalous. Anomaly localization is the related task of segmenting input images into normal and anomalous regions. The output of an anomaly localization model is a 2D array, called an anomaly map, of pixel-level anomaly scores. For example, an anomaly localization model trained on images of non-defective industrial products should output high anomaly scores in image regions corresponding to visible defects. In unsupervised anomaly localization the model is trained solely on normal data, i.e. without labelled training observations that contain anomalies. This is often necessary as anomalous observations may be hard to obtain in sufficient quantities and labelling them is time-consuming and costly. Student-teacher feature pyramid matching (STFPM) is a recent and powerful method for unsupervised anomaly detection and localization that uses a pair of convolutional neural networks of identical architecture. In this thesis we propose two methods of augmenting STFPM to produce better segmentations. Our first method, discrepancy scaling, significantly improves the segmentation performance of STFPM by leveraging pre-calculated statistics containing information about the model’s behaviour on normal data. Our second method, student-teacher model assisted segmentation, uses a frozen STFPM model as a feature detector for a segmentation model which is then trained on data with artificially generated anomalies. Using this second method we are able to produce sharper anomaly maps for which it is easier to set a threshold value that produces good segmentations. Finally, we propose the concept of expected goodness of segmentation, a way of assessing the performance of unsupervised anomaly localization models that, in contrast to current metrics, explicitly takes into account the fact that a segmentation threshold needs to be set. Our primary method, discrepancy scaling, improves segmentation AUROC on the MVTec AD dataset over the base model by 13%, measured in the shrinkage of the residual (1.0 − AUROC). On the image-level anomaly detection task, a variant of the discrepancy scaling method improves performance by 12%.
Automated copy number variation concordance analysis

Thapa Magar, Purushottam (2021)

Rapid growth and advancement of next generation sequencing (NGS) technologies have changed the landscape of genomic medicine. Today, clinical laboratories perform DNA sequencing on a regular basis, which is an error prone process. Erroneous data affects downstream analysis and produces fallacious result. Therefore, external quality assessment (EQA) of laboratories working with NGS data is crucial. Validation of variations such as single nucleotide polymor- phism (SNP) and InDels (<50 bp) is fairly accurate these days. However, detection and quality assessment of large changes such as the copy number variation (CNV) continues to be a concern. In this work, we aimed to study the feasibility of an automated CNV concordance analysis for the laboratory EQA services. We benchmarked variants reported by 25 laboratories against the highly curated gold standard for the son (HG002/NA24385) of the askenazim trio from the Personal Genome Project published by the Genome in a Bottle Consortium (GIAB). We employed two methods to conduct concordance of CNVs, the sequence based comparison with Truvari and the in-house exome-based comparison. For deletion calls of two whole genome sequencing (WGS) submissions, Truvari gained a value greater than 88% and 68% for precision and recall respectively. Conversely, the in-house method’s precision and recall score peaked at 39% and 7.9% respectively for one WGS submission for both deletion and duplication calls. The results indicate that automated CNV concordance analysis of the deletion calls for the WGS-based callset might be feasible with Truvari. On the other hand, results for panel-based targeted sequencing for the deletion calls showed precision and recall rates ranging from 0-80% and 0-5.6% respectively with Truvari. The result suggests that automated concordance analysis of CNVs for targeted sequencing remains a challenge. In conclusion, CNV concordance analysis depends on how the sequence data is generated.
Automatic Detection of Mass Outages in Radio Access Networks

Lintunen, Milla (2023)

Fault management in mobile networks is required for detecting, analysing, and fixing problems appearing in the mobile network. When a large problem appears in the mobile network, multiple alarms are generated from the network elements. Traditionally Network Operations Center (NOC) process the reported failures, create trouble tickets for problems, and perform a root cause analysis. However, alarms do not reveal the root cause of the failure, and the correlation of alarms is often complicated to determine. If the network operator can correlate alarms and manage clustered groups of alarms instead of separate ones, it saves costs, preserves the availability of the mobile network, and improves the quality of service. Operators may have several electricity providers and the network topology is not correlated with the electricity topology. Additionally, network sites and other network elements are not evenly distributed across the network. Hence, we investigate the suitability of a density-based clustering methods to detect mass outages and perform alarm correlation to reduce the amount of created trouble tickets. This thesis focuses on assisting the root cause analysis and detecting correlated power and transmission failures in the mobile network. We implement a Mass Outage Detection Service and form a custom density-based algorithm. Our service performs alarm correlation and creates clusters of possible power and transmission mass outage alarms. We have filed a patent application based on the work done in this thesis. Our results show that we are able to detect mass outages in real time from the data streams. The results also show that detected clusters reduce the number of created trouble tickets and help reduce of the costs of running the network. The number of trouble tickets decreases by 4.7-9.3% for the alarms we process in the service in the tested networks. When we consider only alarms included in the mass outage groups, the reduction is over 75%. Therefore continuing to use, test, and develop implemented Mass Outage Detection Service is beneficial for operators and automated NOC.
Automatic Speech Recognition for Finnish Language

Gold, Ayoola (2021)

The importance of Automatic Speech Recognition cannot be underestimated in today’s worlds as they play a significant role in human computer interaction. ASR systems have been studied deeply over time, but their maximum potential is yet to be explored for the Finnish language. Development of a traditional ASR system involves a number of hand-crafted engineering which has made this technology quite difficult and resourceful to develop. However, with advancements in the field of neural networks, end-to-end ASR neural networks can be developed which can automatically learn the mappings of audio to its corresponding transcript., therefore reducing hand crafted engineering requirements. End-to-end neural network ASR systems have been largely developed commercially by tech giants such as Microsoft, Google and Amazon. However, there are limitations to these commercial services such as data privacy and cost of usage. In this thesis, we explored existing studies in the development of an end-to-end neural network ASR for Finnish language. One successful technique utilized in the development of neural network ASR in the advent of inadequate data is Transfer learning. This is the approach explored in this thesis for the development of the end-to-end neural network ASR system. In addition, the success of this approach was evaluated. In order to achieve this purpose, dataset collected from the Finnish Bank of Finland and Kaggle were used to fine-tune Mozilla DeepSpeech model which is a pretrained end-to-end neural network ASR in English language. The results obtained by fine-tuning the pretrained neural network ASR in English for Finnish language showed a word error rate as low as 40% and character error rate as low as 22%. We therefore concluded that transfer learning is a successful technique for creating ASR model for a new language using a pretrained model in another language with little effort, data and resources.
Benefits and Challenges of Isomorphism in Single-Page Applications : A Case Study and Review of Gray Literature

Huotala, Aleksi (2021)

Isomorphic web applications combine the best parts of static Hypertext Markup Language (HTML) pages and single-page applications. An isomorphic web application shares code between the server and the client. However, there is not much existing research on isomorphic web applications. Improving the performance, user experience and development experience of web applications are popular research topics in computer science. This thesis studies the benefits and challenges of isomorphism in single-page applications. To study the benefits and challenges of isomorphism in single-page applications, a gray literature review and a case study were conducted. The articles used in the gray literature review were searched from four different websites. To make sure the gray literature could be used in this study, a quality assessment process was conducted. The case study was conducted as a developer survey, where developers familiar with isomorphic web applications were interviewed. The results of both studies are then compared and the key findings are compared together. The results of this study show that isomorphism in single-page applications brings benefits to both the developers and the end-users. Isomorphism in single-page applications is challenging to implement and has some downsides, but they mostly affect developers. The performance and search engine optimization of the application are improved. Implementing isomorphism makes it possible to share code between the server and the client, but it increases the complexity of the application. Framework and library compatibility are issues that must be addressed by the developers. The findings of this thesis give motivation for developers to implement isomorphism when starting a new project or transforming existing single-page applications to use isomorphism.
Beyond CAD : Software Assisted Floor Plan Design for Architects

Kinnunen, Lauri (2022)

This thesis is a review of articles focusing on software assisted floor plan design for architecture. I group the articles into optimization, case based design, and machine learning, based on their use of prior examples. I then look into each category and further classify articles based on dimensions relevant to their overall approach. Case based design was a popular research field in the 1990s and early 2000s when several large research projects were conducted. However, since then the research has slowed down. Over the past 20 years, optimization methods to solve architectural floor plans have been researched extensively using a number of different algorithms and data models. The most popular approach is to use a stochastic optimization method such as a genetic algorithm or simulated annealing. More recently, a number of articles have investigated the possibility of using machine learning on architectural floor plans. The advent of neural networks and GAN models, in particular, has spurred a great deal of new research. Despite considerable research efforts, assisted floor plan design has not found its way into commercial applications. To aid industry adoption, more work is needed on the integration of computational design tools into the existing design workflows.
Camera-based food identification and weight estimation in a buffet-style restaurant

Sarapisto, Teemu (2022)

In this thesis we investigate the feasibility of machine learning methods for estimating the type and the weight of individual food items from images taken of customers’ plates at a buffet- style restaurant. The images were collected in collaboration with the University of Turku and Flavoria, a public lunch-line restaurant, where a camera was mounted above the cashier to automatically take a photo of the foods chosen by the customer when they went to pay. For each image, an existing system of scales at the restaurant provided the weights for each individual food item. We describe suitable model architectures and training setups for the weight estimation and food identification tasks and explain the models’ theoretical background. Furthermore we propose and compare two methods for utilizing a restaurant’s daily menu information for improving model performance in both tasks. We show that the models perform well in comparison to baseline methods and reach accuracy on par with other similar work. Additionally, as the images were captured automatically, in some of the images the food was occluded or blurry, or the image contained sensitive customer information. To address this we present computer vision techniques for preprocessing and filtering the images. We publish the dataset containing the preprocessed images along with the corresponding individual food weights for use in future research. The main results of the project have been published as a peer-reviewed article in the International Conference in Pattern Recognition Systems 2022. The article received the best paper award of the conference.
Case study: identifying developer oriented features and capabilities of API developer portals

Garmuyev, Pavel (2022)

RESTful web APIs have gained significant interest over the past decade, especially among large businesses and organizations. However, an important part of being able to use these public web APIs is the knowledge on how to access, consume, and integrate them into applications. Since developers are the primary audience that will be doing the integration it is important to support them throughout their API adoption journey. For this, many of today's companies that are heavily invested in web APIs provide an API developer portal as part of their API management program. However, very little accessible and comprehensive information on how to build and structure API developer portals exist yet. This thesis presents a conducted exploratory multi-case case study of three publicly available API developer portals of three different commercial businesses. The objective of the case study was to identify the developer (end-user) oriented features and capabilities present on the selected developer portals, in order to understand the kinds of information and capabilities API developer portals could provide for developers in general. The exploration was split into three key focus areas: developer onboarding, web API documentation, and developer support and engagement. Based on these, three research questions were formulated respectively. The data consisted of field notes that described observations about the portals. These notes were grouped by location and action, and analyzed to identify a key feature or capability as well as any smaller, compounding features and capabilities. The results describe the identified features and capabilities present on the studied API developer portals. Additionally, some differences between the portals are noted. The key contribution of this thesis are the results themselves, which can be used as a checklist when building new API developer portal. However, the main limitation of this study is that its data collection and analysis processes were subjective and the findings are not properly validated. Such improvements will remain for future work.
Case study: Performance of JavaScript on server side

Valentine, Nicolas (2023)

A case study that studied the performance impact of a node.js component when it was refactored from monolith environment into independent service. The performance study studied the response time of the blocking part of JavaScript code in the component. The non blocking part of the code and the added network overhead from the refactoring were excluded from the performance review. Literature review didn’t show any related research that studied the performance impact of a node.js component when it was refactored from monolith into microservices. Many found studies were found that studied the response time and throughput of REST API build with node.js with comparisons to other programming languages. A study were found that related to refactoring an application from monolith into microservices. None of the found studies were directly related to the studied case. It was noted that the response time of the component improved by 46.5% when it was refactored from monolith into microservice. It is possible that when a node.js monolith application grows it starts to affect the throughput of the event loop affecting performance critical components. For the case component it was beneficial to refactor it into independent service in order to gain the 92.6ms in the mean response time.
Chaining with Maximal Exact Matches for Fast and Accurate Approximation of Edit Distance

Porttinen, Peter (2020)

Computing an edit distance between strings is one of the central problems in both string processing and bioinformatics. Optimal solutions to edit distance are quadratic to the lengths of the input strings. The goal of this thesis is to study a new approach to approximate edit distance. We use a chaining algorithm presented by Mäkinen and Sahlin in "Chaining with overlaps revisited" CPM 2020 implemented verbatim. Building on the chaining algorithm, our focus is on efficiently finding a good set of anchors for the chaining algorithm. We present three approaches to computing the anchors as maximal exact matches: Bi-Directional Burrows-Wheeler Transform, Minimizers, and lastly, a hybrid implementation of the two. Using the maximal exact matches as anchors, we can efficiently compute an optimal chaining alignment for the strings. The chaining alignment further allows us to determine all such intervals where mismatches occur by looking at which sequences are not in the chain. Using these smaller intervals lets us approximate edit distance with a high degree of accuracy and a significant speed improvement. The methods described present a way to approximate edit distance in time complexity bounded by the number of maximal exact matches.

Now showing items 21-40 of 241

Browsing by master's degree program "Tietojenkäsittelytieteen maisteriohjelma"

Yhteystiedot

HELSINGIN YLIOPISTO