Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "draws from the conditional distribution"

Sort by: Order: Results:

  • Pyry, Silomaa (2024)
    This thesis is an empirical comparison of various methods of statistical matching applied to Finnish income and consumption data. The comparison is performed in order to map out some possible matching strategies for Statistics Finland to use in this imputation task and compare the applicability of the strategies within specific datasets. For Statistics Finland, the main point of performing these imputations is in assessing consumption behaviour in years when consumption-related data is not explicitly collected. Within this thesis I compared the imputation of consumption data by imputing 12 consumption variables as well as their sum using the following matching methods: draws from the conditional distribution distance hot deck, predictive mean matching, local residual draws and a gradient boosting approach. The used donor dataset is a sample of households collected for the 2016 Finnish Household Budget Survey (HBS). The recipient dataset is a sample of households collected for the 2019 Finnish Survey of Income and Living Conditions (EU-SILC). In order to assess the quality of the imputations, I used numerical and visual assessments concerning the similarity of the weighted distributions of the consumption variables. The applied numerical assessments were the Kolmogorov-Smirnov (KS) test statistic as well as the Hellinger Distance (HD), the latter of which was calculated for a categorical transformation of the consumption variables. Additionally, the similarities of the correlation matrices were assessed using correlation matrix distance. Generally, distance hot deck and predictive mean matching fared relatively well in the imputation tasks. For example, in the imputation of transport-related expenditure, both produced KS test statistics of approximately 0.01-0.02 and HD of approximately 0.05, whereas the next best-performing method received scores of 0.04 and 0.09, thus representing slightly larger discrepancies. Comparing the two methods, particularly in the imputation of semicontinuous consumption variables, distance hot deck fared notably better than the predictive mean matching approach. As an example, in the consumption expenditure of alcoholic beverages and tobacco, distance hot deck produced values of the KS test statistic and HD of approximately 0.01 and 0.02 respectively whereas the corresponding scores for predictive mean matching were 0.21 and 0.16. Eventually, I would recommend for further application a consideration of both predictive mean matching and distance hot deck depending on the imputation task. This is because predictive mean matching can be applied more easily in different contexts but in certain kinds of imputation tasks distance hot deck clearly outperforms predictive mean matching. Further assessment for this data should be done, in particular the results should be validated with additional data.