Browsing by Subject "HPC"

Now showing items 1-2 of 2

Methods for investigating the external and internal validity of machine learned signals

Hrin, Adam (2023)

Understanding Machine Learning models’ behaviour is becoming increasingly important as models are growing in complexity. This thesis proposes a framework for validating machine learned signals using performance metrics and model explainability tools, applied to the context of Digital Humanities and Social Sciences. The framework allows for investigation whether the real-world problem that the model tries to represent is well-defined and whether the model accurately captures the phenomena at hand. Explainability techniques such as SHAP, LIME and Gradient-based methods have been used. These produce feature importance scores that the model bases its decisions on. The cases presented in this thesis are related to the research in Computational History and Historical Discourse Analysis with High Performance Computing. The subject of analysis is the large language model BERT fine-tuned on Eighteenth Century Collections Online (ECCO) documents that classifies books into genres. Investigating the performance of the classifier with precision-recall curves suggests that the class signals might be overlapping and not clearly delineated. Further results do not suggest that the noise elements present in the data caused by the OCR digitising process have significant importance for the decision making of the model. The explainability techniques helped uncover the model’s inner workings by showing that the model gets its signal mostly from the beginnings of samples. In a proxy task, a simpler linear model was trained to perform a projection from keywords to genres and showed inconsistency in the explainability method. Different subsets of data have been investigated as given by cells of a confusion matrix, the confidence in prediction probability or additional metadata. Investigating individual samples allows for qualitative analysis as well as more detailed signal understanding.
On Integrating Cloud and High Performance Computing Environments In Machine Learning Operations

Siilasjoki, Niila Johan (2024)

Machine learning operations (MLOps) is an intersection paradigm between machine learning (ML), software engineering, and data engineering. It focuses on the development and operations of software engineering by providing principles, components, and workflows that form the MLOps operational support system (OSS) platform. The increasing use of ML with increasing data size and model complexity has created a challenge where the MLOps OSS platforms require cloud and high-performance computing environments to achieve flexible and efficient scalability for different workflows. Unfortunately, there are not many open-source solutions that are user-friendly or viable enough to be utilized by an MLOps OSS platform, which is why this thesis proposes a bridge solution utilized by a pipeline to address the problem. We used Design Science Methodology to define the problem, set objectives, design the implementation, demonstrate the implementation, and evaluate the solution. The resulting solutions are an environment bridge called the HTC-HPC bridge and a pipeline called the Cloud-HPC pipeline that uses it. We defined a general model for Cloud-HPC MLOps pipelines to implement the used functions in a use case suitable infrastructure ecosystem and MLOps OSS platform using open-source, provided, and self-implemented software. The demonstration and evaluation showed that the HTC-HPC bridge and Cloud-HPC pipeline provide easy setup, utilized, customizable, and scalable workflow automation, which can be used for typical ML research workflows. However, it also showed that the bridge needed improved multi-tenancy design and that the pipeline required templates for a better user experience. These aspects, alongside testing use case potential and finding real-world use cases, are part of future work.

Now showing items 1-2 of 2

Browsing by Subject "HPC"

Yhteystiedot

HELSINGIN YLIOPISTO