Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Author "Kalaja, Eero"

Sort by: Order: Results:

  • Kalaja, Eero (2020)
    Nowadays the amount of data collected on individuals is massive. Making this data more available to data scientists could be tremendously beneficial in a wide range of fields. Sharing data is not a trivial matter as it may expose individuals to malicious attacks. The concept of differential privacy was first introduced in the seminal work by Cynthia Dwork (2006b). It offers solutions for tackling this problem. Applying random noise to the shared statistics protects the individuals while allowing data analysts to use the data to improve predictions. Input perturbation technique is a simple version of privatizing data, which adds noise to whole data. This thesis studies an output perturbation technique, where the calculations are done with real data, but only suffcient statistics are released. With this method smaller amount of noise is required making the analysis more accurate. Yu-Xiang Wang (2018) improves the model by introducing an adaptive AdaSSP algorithm to fix the instability issues of the previously used Sufficient Statistics Perturbation (SSP) algorithm. In this thesis we will verify the results shown by Yu-Xiang Wang (2018) and look in to the pre-processing steps more carefully. Yu-Xiang Wang has used some unusual normalization methods especially regarding the sensitivity bounds. We are able show that those had little effect on the results and the AdaSSP algorithm shows its superiority over SSP algorithm also when combined with more common data standardization methods. A small adjustment for the noise levels is suggested for the algorithm to guarantee privacy conditions set by classical Gaussian Mechanism. We will combine different pre-processing mechanisms with AdaSSP algorithm and show a comparative analysis between them. The results show that Robust private linear regression by Honkela et al. (2018) makes significant improvements in predictions with half of the data sets used for testing. The combination of AdaSSP algorithm with robust private linear regression often brings us closer to non-private solutions.