Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "unsupervised learning"

Sort by: Order: Results:

  • Mylläri, Juha (2022)
    Anomaly detection in images is the machine learning task of classifying inputs as normal or anomalous. Anomaly localization is the related task of segmenting input images into normal and anomalous regions. The output of an anomaly localization model is a 2D array, called an anomaly map, of pixel-level anomaly scores. For example, an anomaly localization model trained on images of non-defective industrial products should output high anomaly scores in image regions corresponding to visible defects. In unsupervised anomaly localization the model is trained solely on normal data, i.e. without labelled training observations that contain anomalies. This is often necessary as anomalous observations may be hard to obtain in sufficient quantities and labelling them is time-consuming and costly. Student-teacher feature pyramid matching (STFPM) is a recent and powerful method for unsupervised anomaly detection and localization that uses a pair of convolutional neural networks of identical architecture. In this thesis we propose two methods of augmenting STFPM to produce better segmentations. Our first method, discrepancy scaling, significantly improves the segmentation performance of STFPM by leveraging pre-calculated statistics containing information about the model’s behaviour on normal data. Our second method, student-teacher model assisted segmentation, uses a frozen STFPM model as a feature detector for a segmentation model which is then trained on data with artificially generated anomalies. Using this second method we are able to produce sharper anomaly maps for which it is easier to set a threshold value that produces good segmentations. Finally, we propose the concept of expected goodness of segmentation, a way of assessing the performance of unsupervised anomaly localization models that, in contrast to current metrics, explicitly takes into account the fact that a segmentation threshold needs to be set. Our primary method, discrepancy scaling, improves segmentation AUROC on the MVTec AD dataset over the base model by 13%, measured in the shrinkage of the residual (1.0 − AUROC). On the image-level anomaly detection task, a variant of the discrepancy scaling method improves performance by 12%.
  • Lassila, Juuso (2024)
    Calculating sentence similarities is an essential task for natural language processing. It allows for implementing similarity searches, where the most similar sentence is found out of many for some query sentence, it allows for clustering text by semantic meaning, and finally, sentence embeddings, which are used for calculating the similarities, can also be used as input for any text classification models. There is much room for improvement in sentence embedding model architectures and training methods, both in terms of accuracy and training efficiency. This thesis experiments with a novel unsupervised training method called Sentence Embeddings via Token Inference (SETI), which is efficient by design, to see if it can compete with other methods in accuracy. Using the same data, our experiments train SETI and three other existing training methods: TSDAE, QuickThoughts, and generic MLM. We then compare these models to each other in different sentence similarity and downstream classification tasks. Based on our experiments, SETI is comparable to TSDAE in sentence similarity tasks and better than generic MLM and QuickThoughts training methods in sentence similarity tasks. However, TSDAE has the highest accuracy for downstream classification tasks, while SETI still beats the generic MLM and QuickThoughts models.
  • Hyttinen, Miika (2022)
    An industrial classification system is a set of classes meant to describe different areas of business. Finnish companies are required to declare one main industrial class from TOL 2008 industrial classification system. However, the TOL 2008 system is designed by the Finnish authorities and does not serve the versatile business needs of the private sector. The problem was discovered in Alma Talent Oy, the commissioner of the thesis. This thesis follows the design science approach to create new industrial classifications. To find out what is the problem with TOL 2008 indus- trial classifications, qualitative interviews with customers were carried out. Interviews revealed several needs for new industrial classifications. According to the customer interviews conducted, classifications should be 1) more detailed, 2) simpler, 3) updated regularly, 4) multi-class and 5) able to correct wrongly assigned TOL classes. To create new industrial classifications, un- supervised natural language processing techniques (clustering) were tested on Finnish natural language data sets extracted from company websites. The largest data set contained websites of 805 Finnish companies. The experiment revealed that the interactive clustering method was able to find meaningful clusters for 62%-76% of samples, depending on the clustering method used. Finally, the found clusters were evaluated based on the requirements set by customer interviews. The number of classes extracted from the data set was significantly lower than the number of distinct TOL 2008 classes in the data set. Results indicate that the industrial classification system created with clustering would contain significantly fewer classes compared to TOL 2008 industrial classifications. Also, the system could be updated regularly and it could be able to correct wrongly assigned TOL classes. Therefore, interactive clustering was able to satisfy three of the five requirements found in customer interviews.