Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "clustering"

Sort by: Order: Results:

  • Viitanen, Akke Esa Tapio (2017)
    Active galactic nuclei (AGN) are compact, luminous objects found in the central region of many galaxies. In the standard paradigm, the AGN is fueled by accretion of matter into a supermassive black hole (SMBH). In fact, the properties of many galaxies and their respective SMBHs are linked, which hints at the importance of AGN as factors in galaxy formation and evolution. The bulk of the matter in the Universe is some form of dark matter, which is still poorly understood. AGN are biased tracers of the underlying dark matter distribution. By comparing the clustering of AGN with that of the dark matter, the bias may be quantified and further, the bias can be linked to a characteristic mass of the dark matter halo hosting the AGN. The advent of high-resolution X-ray telescopes, namely Chandra and XMM-Newton, has made unprecedently large samples available for study. With detailed spectroscopic follow-up programs, the study of X-ray selected AGN clustering has received a major boost. The clustering measurements tell of the typical environments that are likely to host AGN and thus shed light on what actually triggers the AGN. In this thesis, the clustering of ∼ 600 X-ray selected AGN with z < 2.5 (z = 1.19) in the COS- MOS (Cosmic evolution survey) field surveyed with XMM-Newton (XMM-COSMOS) is studied. The full sample is split into subsamples based on the host galaxy stellar mass M∗ and the ratio between the X-ray luminosity and the stellar mass LX/M∗ which is a proxy for the Eddington ratio. For the full sample the bias is 3.61+0.37−0.40, which corresponds to a characteristic halo mass of log M halo /h−1 M⊙ = 13.52+0.12−0.16 , consistent with the overall picture of X-ray selected AGN residing in massive haloes with 12.5 < logMhalo/h−1M⊙ < 13.5. The low M∗ and high M∗ samples have biases 3.53+0.58−0.70 and 4.13+0.85−1.07, respectively and the data do not support a difference in the typical masses of the hosting haloes. For the LX/M∗ subsamples, there is marginal evidence that low L X /M∗ AGN (logM halo /h−1M⊙ = 13.52+0.22−0.37) reside in more massive haloes than high L X /M∗ AGN (logM halo /h−1M⊙ = 13.29+0.28−0.58). One possible explanation would be that the environment of the low LX /M∗ AGN reduces the amount of gas available for accretion and thus results in lower accretion rates.
  • Shappo, Viacheslav (2022)
    The primary concern of the companies working with many customers is proper customer segmentation, i.e., division of the customers into different groups based on their common characteristics. Customer segmentation helps marketing specialists to adjust their offers and reach potential customer groups interested in a specific type of product or service. In addition, knowing such customer segments may help search for new look-alike customers sharing similar characteristics. The first and most crucial segmentation is splitting the customers into B2B (business to business) and B2C (business to consumers). The next step is to analyze these groups properly and create more through product-specific groups. Nowadays, machine learning plays a vital role in customer segmentation. This is because various classification algorithms can see more patterns in customer characteristics and create more tailored customer segmentations than a human can. Therefore, utilizing machine learning approaches in customer segmentation may help companies save their costs on marketing campaigns and increase their sales by targeting the correct customers. This thesis aims to analyze B2B customers potentially interested in renewable diesel "Neste MY" and create a classification model for such segmentation. The first part of the thesis is focused on the theoretical background of customer segmentation and its use in marketing. Firstly, the thesis introduces general information about Neste as a company and discusses the marketing stages that involve the customer segmentation approach. Secondly, the data features used in the study are presented. Then the methodological part of the thesis is introduced, and the performance of three selected algorithms is evaluated on the test data. Finally, the study's findings and future means of improvement are discussed. The significant finding of the study is that finely selected features may significantly improve model performance while saving computational power. Several important features are selected as the most crucial customer characteristics that the marketing department afterward uses for future customer segmentations.
  • Koivisto, Teemu (2021)
    Programming courses often receive large quantities of program code submissions to exercises which, due to their large number, are graded and students provided feedback automatically. Teachers might never review these submissions therefore losing a valuable source of insight into student programming patterns. This thesis researches how these submissions could be reviewed efficiently using a software system, and a prototype, CodeClusters, was developed as an additional contribution of this thesis. CodeClusters' design goals are to allow the exploration of the submissions and specifically finding higher-level patterns that could be used to provide feedback to students. Its main features are full-text search and n-grams similarity detection model that can be used to cluster the submissions. Design science research is applied to evaluate CodeClusters' design and to guide the next iteration of the artifact and qualitative analysis, namely thematic synthesis, to evaluate the problem context as well as the ideas of using software for reviewing and providing clustered feedback. The used study method was interviews conducted with teachers who had experience teaching programming courses. Teachers were intrigued by the ability to review submitted student code and to provide more tailored feedback to students. The system, while still a prototype, is considered worthwhile to experiment on programming courses. A tool for analyzing and exploring submissions seems important to enable teachers to better understand how students have solved the exercises. Providing additional feedback can be beneficial to students, yet the feedback should be valuable and the students incentivized to read it.
  • Steenari, Jussi (2023)
    Ship traffic is a major source of global greenhouse gas emissions, and the pressure on the maritime industry to lower its carbon footprint is constantly growing. One easy way for ships to lower their emissions would be to lower their sailing speed. The global ship traffic has for ages followed a practice called "sail fast, then wait", which means that ships try to reach their destination in the fastest possible time regardless and then wait at an anchorage near the harbor for a mooring place to become available. This method is easy to execute logistically, but it does not optimize the sailing speeds to take into account the emissions. An alternative tactic would be to calculate traffic patterns at the destination and use this information to plan the voyage so that the time at anchorage is minimized. This would allow ships to sail at lower speeds without compromising the total length of the journey. To create a model to schedule arrivals at ports, traffic patterns need to be formed on how ships interact with port infrastructure. However, port infrastructure is not widely available in an easy-to-use form. This makes it difficult to develop models that are capable of predicting traffic patterns. However, ship voyage information is readily available from commercial Automatic Information System (AIS) data. In this thesis, I present a novel implementation, which extracts information on the port infrastructure from AIS data using the DBSCAN clustering algorithm. In addition to clustering the AIS data, the implementation presented in this thesis uses a novel optimization method to search for optimal hyperparameters for the DBSCAN algorithm. The optimization process evaluates possible solutions using cluster validity indices (CVI), which are metrics that represent the goodness of clustering. A comparison with different CVIs is done to narrow down the most effective way to cluster AIS data to find information on port infrastructure.
  • Keturi, Joonas (2022)
    The subject of the thesis is the comparison of lexical semantics and phonetics. The thesis investigates with computational methods if there is significantly more phonetic variance in words that belong to the same semantic domains than with phonetically similar words from other semantic domains. In other words, phonetically very similar words and especially phonological minimal pairs would be in separate semantic domains. The method clusters word embedding vectors and distinctive phonological feature vectors from multiple languages, and the phonetic and semantic standard deviations are calculated for each cluster, and the mean standard deviations of cluster sets are compared. In addition to semantic and phonetic clusters, two test clusters are constructed which have the same number and the same size of clusters as the semantic clusters. The first test clusters use the words from phonetic clusters in order and the second test clusters are randomly permuted. These different cluster sets are compared by their mean standard deviations and cluster set similarity index. The results imply that words on the same semantic domains contain rarely phonetically very similar words, and those words are usually in separate semantic domains.
  • Keturi, Joonas (2022)
    The subject of the thesis is the comparison of lexical semantics and phonetics. The thesis investigates with computational methods if there is significantly more phonetic variance in words that belong to the same semantic domains than with phonetically similar words from other semantic domains. In other words, phonetically very similar words and especially phonological minimal pairs would be in separate semantic domains. The method clusters word embedding vectors and distinctive phonological feature vectors from multiple languages, and the phonetic and semantic standard deviations are calculated for each cluster, and the mean standard deviations of cluster sets are compared. In addition to semantic and phonetic clusters, two test clusters are constructed which have the same number and the same size of clusters as the semantic clusters. The first test clusters use the words from phonetic clusters in order and the second test clusters are randomly permuted. These different cluster sets are compared by their mean standard deviations and cluster set similarity index. The results imply that words on the same semantic domains contain rarely phonetically very similar words, and those words are usually in separate semantic domains.
  • Litova, Maria (2023)
    The self-organizing map (SOM) is a form of unsupervised neural network and a method for data analysis that allows reducing the dimensionality of data, exploring the variation and dependencies between variables and presenting their similarity relations. Being a powerful visualization instrument and having a strong disposition for clustering, the self-organizing map could be implemented to the analysis of survey data, particularly, collected with the questionnaires. This thesis provides a relevant example of dealing with the limited size mixed survey data set. The self-organizing map algorithm is implemented to analyze the data obtained from the faculty well-being project organized at the Faculty of Social Sciences in the University of Helsinki. The set of experiments utilize the self-organizing map algorithm to explore a possible clustering structure of the data and identify the profiles of the survey participants. Each of three experiments illustrates different variable encoding approaches for the sets of closed background and Likert scale questions. The largest number of the profiles was obtained from the final experiment. Four out of seven profiles represent clusters of the individuals with mainly neutral, negative or very negative experiences related to the well-being at the faculty. The data analysis experiments also illustrate the possible challenges of the SOM method implementation to survey data. The existence of categorical variables, the necessity of choosing a set of parameters for the SOM training and dealing with the missing values are discussed as main challenges of the SOM implementation to survey data analysis using the R package “kohonen”.
  • Litova, Maria (2023)
    The self-organizing map (SOM) is a form of unsupervised neural network and a method for data analysis that allows reducing the dimensionality of data, exploring the variation and dependencies between variables and presenting their similarity relations. Being a powerful visualization instrument and having a strong disposition for clustering, the self-organizing map could be implemented to the analysis of survey data, particularly, collected with the questionnaires. This thesis provides a relevant example of dealing with the limited size mixed survey data set. The self-organizing map algorithm is implemented to analyze the data obtained from the faculty well-being project organized at the Faculty of Social Sciences in the University of Helsinki. The set of experiments utilize the self-organizing map algorithm to explore a possible clustering structure of the data and identify the profiles of the survey participants. Each of three experiments illustrates different variable encoding approaches for the sets of closed background and Likert scale questions. The largest number of the profiles was obtained from the final experiment. Four out of seven profiles represent clusters of the individuals with mainly neutral, negative or very negative experiences related to the well-being at the faculty. The data analysis experiments also illustrate the possible challenges of the SOM method implementation to survey data. The existence of categorical variables, the necessity of choosing a set of parameters for the SOM training and dealing with the missing values are discussed as main challenges of the SOM implementation to survey data analysis using the R package “kohonen”.
  • Kramar, Vladimir (2022)
    This work presents a novel concept of categorising failures within test logs using string similarity algorithms. The concept was implemented in the form of a tool that went through three major iterations to its final version. These iterations are the following: 1) utilising two state-of-the-art log parsing algorithms, 2) manual log parsing of the Pytest testing framework, and 3) parsing of .xml files produced by the Pytest testing framework. The unstructured test logs were automatically converted into a structured format using the three approaches. Then, structured data was compared using five different string similarity algorithms, Sequence Matcher, Jaccard index, Jaro-Winkler distance, cosine similarity and Levenshtein ratio, to form the clusters. The results from each approach were implemented and validated across three different data sets. The concept was validated by implementing an open-sourced Test Failure Analysis (TFA) tool. The validation phase revealed the best implementation approach (approach 3) and the best string similarity algorithm for this task (cosine similarity). Lastly, the tool was deployed into an open-source project’s CI pipeline. Results of this integration, application and usage are reported. The achieved tool significantly reduces software engineers’ manual work and error-prone work by utilising cosine similarity as a similarity score to form clusters of failures.
  • Hyttinen, Miika (2022)
    An industrial classification system is a set of classes meant to describe different areas of business. Finnish companies are required to declare one main industrial class from TOL 2008 industrial classification system. However, the TOL 2008 system is designed by the Finnish authorities and does not serve the versatile business needs of the private sector. The problem was discovered in Alma Talent Oy, the commissioner of the thesis. This thesis follows the design science approach to create new industrial classifications. To find out what is the problem with TOL 2008 indus- trial classifications, qualitative interviews with customers were carried out. Interviews revealed several needs for new industrial classifications. According to the customer interviews conducted, classifications should be 1) more detailed, 2) simpler, 3) updated regularly, 4) multi-class and 5) able to correct wrongly assigned TOL classes. To create new industrial classifications, un- supervised natural language processing techniques (clustering) were tested on Finnish natural language data sets extracted from company websites. The largest data set contained websites of 805 Finnish companies. The experiment revealed that the interactive clustering method was able to find meaningful clusters for 62%-76% of samples, depending on the clustering method used. Finally, the found clusters were evaluated based on the requirements set by customer interviews. The number of classes extracted from the data set was significantly lower than the number of distinct TOL 2008 classes in the data set. Results indicate that the industrial classification system created with clustering would contain significantly fewer classes compared to TOL 2008 industrial classifications. Also, the system could be updated regularly and it could be able to correct wrongly assigned TOL classes. Therefore, interactive clustering was able to satisfy three of the five requirements found in customer interviews.