Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "Classification"

Sort by: Order: Results:

  • Simsek, Burak (2020)
    In this study, a classification scheme is implemented to obtain high resolution snow cover information from Sentinel-2 data using a very simple Bayesian Network (Naive-Bayes) that is trained with ground snow measurement data. Performance comparison of using Bayesian/non-Bayesian Naive-Bayes, different feature sets and different discretization methods is conducted. Results show that Bayesian NB performs the best with up to 0.88 classification accuracy for snow/no-snow classification. Use of most relevant spectral bands rather than all available bands provided improvement in some cases but also performed slighty worse in some, hence not giving a clear answer. However, effect of discretization method was clear, chimerge performed better than equal width binning but it was much slower to a point that it was not practical to discretisize a full Sentinel-2 image’s pixels.
  • Kinnula, Ville (2021)
    In inductive inference phenomena from the past are modeled in order to make predictions of the future. The mathematical concept of exchangeability for random sequences provides a mathematical justification for the assumption that observations are independently and identically distributed given some underlying parameters estimable from the empirical distribution of the observations. The theory of exchangeability contains basic elements for inductive inference, such as the de Finetti representation theorem for the probability of a general exchangeable sequence, prior probability distributions for the parameters in the representation theorem, as well as the predictive probabilities, or rule of succession, for new observations from the random sequence under consideration. However, entirely unanticipated observations pose a problem for inductive inference. How can one assign a probability for an event that has never been seen before? This is called the sampling of species problem. Under exchangeability, the number of possible different events t has to be known before-hand to be able to assign an equal prior probability 1/t for each event. In the sampling of species problem an assumption of infinite possible events has to be made, leading to the prior probability 1/∞ for each event, which is impossible. Exchangeability is thus inadequate to handle probability distributions for infinite possible events. It turns out that a solution to the sampling of species problem arises from partition exchangeability. Exchangeable random sequences have the same probability of occurring, if the observations in the sequence have identical frequencies. Under partition exchangeability, the sequences have the same probability of occurring when they share identical frequencies of frequencies. In this thesis, partition exchangeability is introduced as a framework of inductive inference by juxtaposing it with the more familiar type of exchangeability for random sequences. Partition exchangeability has parallel elements to exchangeability, in the Kingman representation theorem, the Poisson-Dirichlet distribution for the prior probability distribution, and a corresponding rule of succession. The rules of succession are required in the problem of supervised classification to provide product predictive probabilities to be maximized by assigning the test data into pre-defined classes based on training data. A Bayesian construction of supervised classification is discussed in this thesis. In theory, the best classification performance is gained when assigning the class labels to the test data simultaneously, but because of computational complexity, an assumption is often made where the test data points are i.i.d. with regards to each other. In the case of a known set of possible events these simultaneous and marginal classifiers converge in their test data predictive probabilities as the amount of training data tends to infinity, justifying the use of the simpler marginal classifier with enough training data. These two classifiers are implemented in this thesis under partition exchangeability, and it is shown in theory and in practice with a simulation study that the same asymptotic convergence between the simultaneous and marginal classifiers applies with partition exchangeable data as well. Finally, a small application in single cell RNA expression is explored.
  • Kukkola, Johanna (2022)
    Can a day be classified to the correct season on the basis of its hourly weather observations using a neural network model, and how accurately can this be done? This is the question this thesis aims to answer. The weather observation data was retrieved from Finnish Meteorological Institute’s website, and it includes the hourly weather observations from Kumpula observation station from years 2010-2020. The weather observations used for the classification were cloud amount, air pressure, precipitation amount, relative humidity, snow depth, air temperature, dew-point temperature, horizontal visibility, wind direction, gust speed and wind speed. There are four distinct seasons that can be experienced in Finland. In this thesis the seasons were defined as three-month periods, with winter consisting of December, January and February, spring consisting of March, April and May, summer consisting of June, July and August, and autumn consisting of September, October and November. The days in the weather data were classified into these seasons with a convolutional neural network model. The model included a convolutional layer followed by a fully connected layer, with the width of both layers being 16 nodes. The accuracy of the classification with this model was 0.80. The model performed better than a multinomial logistic regression model, which had accuracy of 0.75. It can be concluded that the classification task was satisfactorily successful. An interesting finding was that neither models ever confused summer and winter with each other.
  • Viljamaa, Venla (2022)
    In bioinformatics, new genomes are sequenced at an increasing rate. To utilize this data in various bioinformatics problems, it must be annotated first. Genome annotation is a computational problem that has traditionally been approached by using statistical methods such as the Hidden Markov model (HMM). However, implementing these methods is often time-consuming and requires domain knowledge. Neural network-based approaches have also been developed for the task, but they typically require a large amount of pre-labeled data. Genomes and natural language share many properties, not least the fact that they both consist of letters. Genomes also have their own grammar, semantics, and context-based meanings, just like phrases in the natural language. These similarities give motivation to the use of Natural language processing (NLP) techniques in genome annotation. In recent years, pre-trained Transformer neural networks have been widely used in NLP. This thesis shows that due to the linguistic properties of genomic data, Transformer network architecture is also suitable for gene predicting. The model used in the experiments, DNABERT, is pre-trained using the full human genome. Using task-specific labeled data sets, the model is then trained to classify DNA sequences into genes and non-genes. The main fine-tuning dataset is the genome of the Escherichia coli bacterium, but preliminary experiments are also performed on human chromosome data. The fine-tuned models are evaluated for accuracy, F1-score and Matthews correlation coefficient (MCC). A customized estimation method is developed, in which the predictions are compared to ground-truth labels at the nucleotide level. Based on that, the best models achieve a 90.15% accuracy and an MCC value of 0.4683 using the Escherichia coli dataset. The model correctly classifies even the minority label, and the execution times are measured in minutes rather than hours. These suggest that the NLP-based Transformer network is a powerful tool for learning the characteristics of gene and non-gene sequences.