Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "Gradient Boosting"

Sort by: Order: Results:

  • Pelttari, Hannu (2020)
    Federated learning is a method to train a machine learning model on multiple remote datasets without the need to gather the data from the remote sites to a central location. In healthcare, gathering the data from different hospitals into a central location can be a difficult and time-consuming task, due to privacy concerns and regulations regarding the use of sensitive data, making federated learning an attractive alternative to more traditional methods. This thesis adapted an existing federated gradient boosting model and developed a new federated random forest model and applied them to mortality prediction in intensive care units. The results were then compared to the centralized counterparts of the models. The results showed that while the federated models did not perform as well as the centralized models on a similar sized dataset, the federated random forest model can achieve superior performance when trained on multiple hospitals' data compared to centralized models trained on a single hospital. In scenarios where the centralized models had data from multiple hospitals the federated models could not perform as well as the centralized models. It was also found that the performance of the centralized models could not be improved with further federated training. In addition to practical advantages such as possibility of parallel or asynchronous training without modifications to the algorithm, the federated random forest performed better in all scenarios compared to the federated gradient boosting. The performance of the federated random forest was also found to be more consistent over different scenarios than the performance of federated gradient boosting, which was highly dependent on factors such as the order with the hospitals were traversed.
  • Hentunen, Saul (2022)
    Tonttien tilastollisilla hinta-arvioilla on käyttöä arvostuspohjaisen hintaindeksin rakentamisessa sekä suurien tonttikauppojen hintojen jaottelemisessa kohteilleen. Tämä tutkimus laajentaa aikaisempaa tutkimusta asuintonttien hinnoista tutkimalla liike- ja toimistotonttien hintoja. Tutkimuksessa selvitetään, poikkeaako toimitilatonttien hinnat asuintonttien hinnoista. Lisäksi selvitetään mallien hinta-arvioiden tarkkuutta tonttien hintojen mallintamisessa. Tutkimus toteutetaan Maamittauslaitoksen kauppahintarekisterillä, joka sisältää tietoja Suomessa tehdyistä kiinteistö- ja tonttikaupoista. Tutkimuksessa tuodaan esille rekisteriaineiston rajauksessa käytetyt ehdot sekä aineiston tietojen täydentämiseen käytetyt aineistot ja menetelmät. Tutkimuksessa esitellään yksityiskohtaisesti tonttien hinta-arvioiden laskemiseen käytettävät mallit. Tonttien hintoja mallinnetaan lineaarisella mallilla sekä koneoppimismetodilla tehostetulla regressiopuu-mallilla. Malleissa käytetyt selittävät muuttujat on valittu rekisteriaineistosta aikaisempaa tutkimusta apuna käyttäen. Rekisteriaineiston pohjalta on mahdollista koota useita tekijöitä, joilla voidaan arvioida tontin neliöhintaa. Mallien pohjalta ei voida kuitenkaan yksiselitteisesti sanoa, että liike- ja toimistotontit olisivat lähtökohtaisesti arvokkaampia kuin asuintontit. Poikkeavien tonttikauppojen poistamisen jälkeen koneoppimismetodilla tehostetulla regressiopuu-mallilla voidaan arvioida asuintonttien hintoja 15 prosentin tarkkuudella noin kolmannekselle tonteista. Liike- ja toimistotonteille vastaava tarkkuus saadaan noin kuudennekselle toimitilatonteista. Tutkimuksen tuloksena suositellaan, että tonttien hintoja mallinnetaan koneoppismetodein tehostetuilla regressiopuilla lineaarisen mallin sijasta. Mallin hinta-arvioiden tarkkuuden parantamiseksi suositellaan aineiston kasvattamista aikaväliä laajentamalla ja erityisesti liike- ja toimistotonttien määrän lisäämistä tutkimusaineistoon. Lisäksi suositellaan maapohjan laatutekijöiden tarkempaa tutkimista tutkimusaineiston tonteille.
  • Saada, Adam (2018)
    Logistic regression has been the most common credit scoring model for several decades. The purpose of a credit scoring model is to distinguish good applicants from bad applicants so that the consumer credit can be lent to a person who is likely to repay it. In Finland, households' indebtedness has increased while wage development has stagnated. In addition to mortgage, indebtedness has increased because of the rising number of consumer credit loans. Consumer credit is usually unsecured loans, which are provided by several financial institutions quickly and flexible. Consumer credit is considered to be one of the major causes of default. Systematic risks are still being avoided for now, but the increased number of customers and the fierce competition in the sector can bring new risks that should be anticipated, as insolvent customers are making losses to financial institutions. Developing and deploying new credit scoring models is one of the best ways to hedge against default risks. The prediction accuracy and performance of tree-based credit scoring models have been studied. In many cases, tree-based algorithms have performed better than traditional statistical models such as the earlier mentioned logistic regression. In this master's thesis classical logistic regression is compared to these tree-based algorithms. The most well-known tree-based algorithms have been chosen, which are random forest, discrete Adaboost, real Adaboost, LogitBoost, Gentle Adaboost and Gradient Boosting. These methods use the tree algorithm as the base learner but differ in their iterative processes. The data that has been gathered from a Finnish medium-sized financial company, consists of customer's personal information and their payment behavior of sales finance. It is important to compare how different models predict insolvency in the light of different test statistics. In this thesis, the best-performing models are logistic regression and the Gradient Boosting algorithm. From my research's point of view, it is recommended to develop a credit scoring model based on the Gradient Boosting algorithm. This algorithm discloses different explanatory variables compared to logistic regression. These variables can explain better the causes of insolvency. The results are robust and plausible, because the different tests give similar conclusions.