Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Author "Flinck, Jens"

Sort by: Order: Results:

  • Flinck, Jens (2023)
    This thesis focuses on statistical topics that proved important during a research project involving quality control in chemical forensics. This includes general observations about the goals and challenges a statistician may face when working together with a researcher. The research project involved analyzing a dataset with high dimensionality compared to the sample size in order to figure out if parts of the dataset can be considered distinct from the rest. Principal component analysis and Hotelling's T^2 statistic were used to answer this research question. Because of this the thesis introduces the ideas behind both procedures as well as the general idea behind multivariate analysis of variance. Principal component analysis is a procedure that is used to reduce the dimension of a sample. On the other hand, the Hotelling's T^2 statistic is a method for conducting multivariate hypothesis testing for a dataset consisting of one or two samples. One way of detecting outliers in a sample transformed with principal component analysis involves the use of the Hotelling's T^2 statistic. However, using both procedures together breaks the theory behind the Hotelling's T^2 statistic. Due to this the resulting information is considered more of a guideline than a hard rule for the purposes of outlier detection. To figure out how the different attributes of the transformed sample influence the number of outliers detected according to the Hotelling's T^2 statistic, the thesis includes a simulation experiment. The simulation experiment involves generating a large number of datasets. Each observation in a dataset contains the number of outliers according to the Hotelling's T^2 statistic in a sample that is generated from a specific multivariate normal distribution and transformed with principal component analysis. The attributes that are used to create the transformed samples vary between the datasets, and in some datasets the samples are instead generated from two different multivariate normal distributions. The datasets are observed and compared against each other to find out how the specific attributes affect the frequencies of different numbers of outliers in a dataset, and to see how much the datasets differ when a part of the sample is generated from a different multivariate normal distribution. The results of the experiment indicate that the only attributes that directly influence the number of outliers are the sample size and the number of principal components used in the principal component analysis. The mean number of outliers divided by the sample size is smaller than the significance level used for the outlier detection and approaches the significance level when the sample size increases, implying that the procedure is consistent and conservative. In addition, when some part of the sample is generated from a different multivariate normal distribution than the rest, the frequency of outliers can potentially increase significantly. This indicates that the number of outliers according to Hotelling's T^2 statistic in a sample transformed with principal component analysis can potentially be used to confirm that some part of the sample is distinct from the rest.