Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by study line "General track"

Sort by: Order: Results:

  • Kuivaniemi, Esa (2024)
    Machine Learning (ML) has experienced significant growth, fuelled by the surge in big data. Organizations leverage ML techniques to take advantage of the data. So far, the focus has predominantly been on increasing the value by developing ML algorithms. Another option would be to optimize resource consumption to reach cost optimality. This thesis contributes to cost optimality by identifying and testing frameworks that enable organizations to make informed decisions on cost-effective cloud infrastructure while designing and developing ML workflows. The two frameworks we introduce to model Cost Optimality are: "Cost Optimal Query Processing in the Cloud" for data pipelines and "PALEO" for ML model training pipelines. The latter focuses on estimating the training time needed to train a Neural Net, while the first one is more generic in assessing cost-optimal cloud setup for query processing. Through the literature review, we show that it is critical to consider both the data and ML training aspects when designing a cost-optimal ML workflow. Our results indicate that the frameworks provide accurate estimates about cost-optimal hardware configuration in the cloud for ML workflow. There are deviations when we dive into the details: our chosen version of the Cost Optimal Model does not consider the impact of larger memory. Also, the frameworks do not provide accurate execution time estimates: PALEO estimates our accelerated EC2 instance to execute the training workload with half of the time it took. However, the purpose of the study was not to provide accurate execution or cost estimates, but we aimed to see if the frameworks estimate the cost-optimal cloud infrastructure setup among the five EC2 instances that we chose to execute our three different workloads.
  • Suihkonen, Sini (2023)
    The importance of protecting sensitive data from information breaches has increased in recent years due to companies and other institutions gathering massive datasets about their customers, including personally identifiable information. Differential privacy is one of the state-of-the-art methods for providing provable privacy to these datasets, protecting them from adversarial attacks. This thesis focuses on studying existing differentially private random forest (DPRF) algorithms, comparing them, and constructing a version of the DPRF algorithm based on these algorithms. Twelve articles from the late 2000s to 2022, each implementing a version of the DPRF algorithm, are included in the review of previous work. The created algorithm, called DPRF_thesis , uses a privatized median as a method for splitting internal nodes of the decision trees. The class counts of the leaf-nodes are made with the exponential mechanism. Tests on the DPRF_thesis algorithm were run on three binary classification UCI datasets, and the accuracy results were mostly comparable with the two existing DPRF algorithms DPRF_thesis was compared to. ACM Computing Classification System (CCS): Computing methodologies → Machine learning → Machine learning approaches → Classification and regression trees Security and privacy → Database and storage security → Data anonymization and sanitization
  • Lampinen, Sebastian (2022)
    Modeling customer engagement assists a business in identifying the high risk and high potential customers. A way to define high risk and high potential customers in a Software-as-a-Service (SaaS) business is to define them as customers with high potential to churn or upgrade. Identifying the high risk and high potential customers in time can help the business retain and grow revenue. This thesis uses churn and upgrade prediction classifiers to define a customer engagement score for a SaaS business. The classifiers used and compared in the research were logistic regression, random forest and XGBoost. The classifiers were trained using data from the case-company containing customer data such as user count and feature usage. To tackle class imbalance, the models were also trained with oversampled training data. The hyperparameters of each classifier were optimised using grid search. After training the models, performance of the classifiers on a test data was evaluated. In the end, the XGBoost classifiers outperformed the other classifiers in churn prediction. In predicting customer upgrades, the results were more mixed. Feature importances were also calculated, and the results showed that the importances differ for churn and upgrade prediction.
  • Pyykölä, Sara (2022)
    This thesis regards non-Lambertian surfaces and their challenges, solutions and study in computer vision. The physical theory for understanding the phenomenon is built first, using the Lambertian reflectance model, which defines Lambertian surfaces as ideally diffuse surfaces, whose luminance is isotropic and the luminous intensity obeys Lambert's cosine law. From these two assumptions, non-Lambertian surfaces violate at least the cosine law and are consequently specularly reflecting surfaces, whose perceived brightness is dependent from the viewpoint. Thus non-Lambertian surfaces violate also brightness and colour constancies, which assume that the brightness and colour of same real-world points stays constant across images. These assumptions are used, for example, in tracking and feature matching and thus non-Lambertian surfaces pose complications for object reconstruction and navigation among other tasks in the field of computer vision. After formulating the theoretical foundation of necessary physics and a more general reflectance model called the bi-directional reflectance distribution function, a comprehensive literature review into significant studies regarding non-Lambertian surfaces is conducted. The primary topics of the survey include photometric stereo and navigation systems, while considering other potential fields, such as fusion methods and illumination invariance. The goal of the survey is to formulate a detailed and in-depth answer to what methods can be used to solve the challenges posed by non-Lambertian surfaces, what are these methods' strengths and weaknesses, what are the used datasets and what remains to be answered by further research. After the survey, a dataset is collected and presented, and an outline of another dataset to be published in an upcoming paper is presented. Then a general discussion about the survey and the study is undertaken and conclusions along with proposed future steps are introduced.
  • Pirilä, Pauliina (2024)
    This thesis discusses short-term parking pricing in the context of Finnish shopping centre parking halls. The focus is on one shopping centre located in Helsinki where parking fees are high and there is a constant need for raising the prices. Therefore, it is important to have a strategy that maximises parking hall income without compromising the customers' interest. If the prices are too high, customers will choose to park elsewhere or reduce their parking in private parking halls. There is a lot of competition with off-street parking competing against on-street parking and access parking, not to mention other parking halls. The main goal of this thesis is to raise problems with parking pricing and discuss how to find the most beneficial pricing method. To achieve this, this thesis project conducted an analysis on one Finnish shopping centre parking hall data. This data was analysed to discover the average behaviour of the parkers and how the raised parking fees affect both the parker numbers and the income of the parking hall. In addition, several pricing strategies from literature and real-life examples were discussed and evaluated, and later combined with the analysis results. The results showed that there are some similarities with results from literature but there were some surprising outcomes too. It seems that higher average hourly prices are correlated with longer stays, but still the parkers who tend to park longer have more inelastic parking habits than those who park for shorter durations. The calculated price elasticity of demand values show that compared to other parking halls, parking is on average more elastic in the analysed parking hall. This further emphasises the importance of milder price raises at least for the shorter parking durations. Moreover, there are noticeable but explainable characteristics in parker behaviour. Most of the parkers prefer to park for under one hour to take advantage of the first parking hour being free. This leads to profit losses in both the shopping centre and parking hall income. Therefore, a dynamic pricing strategy is suggested as one pricing option, since it adjusts the prices automatically based on occupancy rates. Although there are some challenges with this particular method, in the long run it could turn out to be the most beneficial for both the parking hall owners and the parkers. To conclude, choosing a suitable pricing strategy and model for a parking hall is crucial and the decisions should be based on findings from data.