Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by department "Department of Computer Science"

Sort by: Order: Results:

  • Santana Vega, Carlos (2018)
    The scope of this project is to provide a set of Bayesian methods to be applied to the task of potential energy barriers prediction. Energy barriers define a physical property of atoms that can be used to characterise their molecular dynamics, with applications in quantum-mechanics simulations for the design of new materials. The goal is to replace the currently used artificial neural network (ANN) with a method that apart of providing accurate predictions, can also assess the predictive certainty of the model. We propose several Bayesian methods and evaluate them on this task, demonstrating that sparse Gaussian process (SGP) are capable of providing predictions, and their confidence intervals, with a level of accuracy equivalent to the current ANN, in a bounded computational complexity time.
  • Meiling, Li (2017)
    In the field of scientific research, computer simulation, Internet applications, e-commerce and many other applications, the amount of data is growing at an extremely fast pace. In order to analyze and utilize these large data resources, it is necessary to rely on effective data analysis techniques. The relational database (RDBMS) model has always been a dominant database model in database management. However, the traditional relational data management technology encountered great obstacles in the scalability, as it has difficulties with the big data analysis. Today, the cloud databases and NoSQL databases are attracting widespread attentions and become optional choices besides the relational database. This thesis mainly focuses on benchmarking studies of two multi-model NoSQL databases, ArangoDB and OrientDB and discusses the use of NoSQL for the big data analysis.
  • Tuominen, Pasi (2015)
    Tietovarannoissa esiintyy monesti useita tietueita, jotka kuvaavat samaa objektia. Tässä tutkielmassa on vertailtu näiden tietueiden löytämiseen käytettäviä menetelmiä. Kokeet on suoritettu aineistolla, jossa on 6,4 miljoonaa bibliografista tietuetta. Menetelmien vertailussa käytettiin aineistossa olevien teosten nimekkeitä. Eri menetelmien kahta keskeistä piirrettä on mitattu: löydettyjen duplikaattien lukumäärää ja niiden suhdetta muodostettujen kandidaattien lukumäärään. Kahden menetelmän yhdistelmä osoittautui parhaaksi aineiston deduplikointiin. Järjestetyllä naapurustolla löytyi eniten varsinaisia duplikaatteja, mutta myös eniten irrelevantteja kandidaatteja. Suffiksitauluryhmittelyn avulla löytyi lisäksi joukko duplikaatteja joita muilla menetelmillä ei löytynyt. Yhdessä nämä kaksi menetelmää löysivät lähes kaikki duplikaatit mitä kaikki tutkielmassa verratut menetelmät löysivät. Levenshtein-etäisyyteen perustuvat virhesietoiset menetelmät osoittautuivat tehottomiksi nimekkeiden deduplikoinnissa.
  • Toivonen, Mirva (2015)
    Big data creates variety of business possibilities and helps to gain competitive advantage through predictions, optimization and adaptability. Impact of errors or inconsistencies across the different sources, from where the data is originated and how frequently data is acquired is not considered in much of the big data analysis. This thesis examines big data quality challenges in the context of business analytics. The intent of the thesis is to improve the knowledge of big data quality issues and testing big data. Most of the quality challenges are related to understanding the data, coping with messy source data and interpreting analytical results. Producing analytics requires subjective decisions along the analysis pipeline and analytical results may not lead to objective truth. Errors in big data are not corrected like in traditional data, instead the focus of testing is moved towards process oriented validation.
  • Ronimus, Tomi (2013)
    Botnets have proven to be consistent nuisance on the Internet. They are the cause for many security concerns and issues that plague the Internet currently. Mitigating these issues is an important task and more research is needed in order to win the battle against constantly evolving botnets. In this thesis, botnets are reviewed thoroughly, starting from what botnets are and how do they manage to stay operational and then moving on to explore some of the more promising methods that can be used to detect botnet activity. A more detailed look is performed on DNS-based botnet detection methods as these methods show great promise and are very capable of detecting many different types of botnets. Finally, a review on the DNS-based botnet detection methods is compiled. Some of the best features of botnet detection are gathered to form an overall picture of what are the characteristics of a good detection method. As botnets evolve over time, botnet detection methods need to keep up with the progress. Gathering characteristics of a good detection method will help to suggest future directions on how to improve and develop new botnet detection methods. ACM Computing Classification System (CCS): A.1 [Introductory and Survey], C.2.0 [Computer Communication Networks]
  • Suominen, Kalle (2013)
    Business and operational environments are becoming more and more frenetic, forcing companies and organizations to respond to changes faster. This trend reflects to software development as well, IT units have to deliver needed features faster in order to bring business benefits quicker. During the last decade, agile methodologies have provided tools to answer to this ever-growing demand. Scrum is one of the agile methodologies and it is widely used. It is said that in large-scale organizations Scrum implementation should be done using both bottom-up and top-down approaches. In big organizations software systems are complicated and deeply integrated with each other meaning that no one team can handle whole software development processes alone. Individual teams want to start to use Scrum before whole organization is ready to support it. This leads to a situation where one team is applying agile principles while most of the other teams and organizations around are continuing with old established non-agile practices. In these cases bottom-up approach is the only option. When the top-down part is missing, are the benefits also lost? In this case study, the target is to find out, did it bring benefits when implementing Scrum using only bottom-up approach. In the target unit, which was part of the large organization, Scrum based practices were implemented to replace earlier waterfall based approach. Analyses for the study were made on data, which was collected by survey and from a requirement management tool. This tool was in use during the old and new ways of working. Expression Scrum based practices are used because all of the fine flavours of Scrum could not be able to be implemented because of surrounded non-agile teams and official non-agile procedures. This was also an obstacle when trying to implement Scrum as well as it could be possible. Most of the defined targets given to the implementation of Scrum based practices were achieved and other non-targeted benefit came out. In this context we can conclude that benefits were gained. The top-down approach absence clearly made the implementation more difficult and incomplete; however, it didn't prevent to get benefits. The target unit also faced earlier mentioned difficulties in using Scrum based practices while other units around used non-agile processes. The lack of good established numerical estimations of requirements' business values lowered the power of the Scrum on a company level, because these values were relative and subjective opinions of the business representatives, In the backlog prioritization, when most of the items are so called high priority ones there is no way to evaluate which one is more valuable and prioritization is more or less a lottery
  • Markkanen, Jani (2012)
    B-puut ovat yleisesti käytettyjä hakemistopuita. Tutkielmassa tutustutaan B-puiden samanaikaisuudenhallintaan ja elvytykseen erityisesti tietokannanhallintajärjestelmän kannalta. Tehokkaan samanaikaisuudenhallinnan tarjoavan Blink-puun algoritmeista esitellään solmujen poistojen seurantaan ja läpikäydessä rakennemuutoksien viimeistelyyn perustuvat algoritmit. Näistä jälkimmäinen toteutetaan ja sen tehokkuutta arvioidaan kokeellisesti. Kokeellisessa arvioinnissa huomataan, että lisäys- ja poisto-operaatioissa samanaikaisuudenhallinnan kustannus nousee jopa 94 %:iin arvioinnin maksimioperaatiotiheydellä. Samalla maksimioperaatiotiheydellä hakuoperaation samanaikaisuudenhallinta vie alle prosentin kokonaisajasta. Korkea samanaikaisuudenhallinnan kustannus lisäys- ja poisto-operaatioissa johtuu päivitysoperaatioiden U-salpaamasta juurisolmusta. Juurisolmun U-salpaus on usein turhan vahva toimenpide, sillä sitä tarvitaan vain 0,06 % päivitysoperaatioita, kun salpa halutaan korottaa kirjoittamista varten X-salvaksi. Puun juuren ruuhkan helpottamiseksi esitellään algoritmille jatkokehitysideoita, jotka perustuvat juuren U-salpauksen tarpeen harvinaisuuteen ja mahdollisuuteen aloittaa puun läpikäynti aina uudelleen puun juuresta.
  • Levitski, Andres (2016)
    With the increase in bandwidths available for internet users, cloud storage services have emerged to offer home users an easy way to share files and extend the storage space available for them. Most systems offer a limited free storage quota and combining these resources from multiple providers could be intriguing to cost-oriented users. In this study, we will implement a virtual file system that utilizes multiple different commercial cloud storage services (Dropbox, Google Drive, Microsoft OneDrive) to store its data. The data will be distributed among the different services and the structure of the data will be managed locally by the file system. The file system will be run in user space using FUSE and will use APIs provided by the cloud storage services to access the data. Our goal is to show that it is feasible to combine the free space offered by multiple services into a single easily accessible storage medium. Building such a system requires making design choices in multiple problem areas ranging from data distribution and performance to data integrity and data security. We will show how our file system is designed to address these requirements and will then conduct several tests to measure and analyze the level of performance provided by our system in different file system operation scenarios. The results will also be compared to the performance of using the distinct cloud storage services directly without distributing the data. This will help us to estimate the overhead or possible gain in performance caused by the distribution of data. It will also help us to locate the bottlenecks of the system. Finally, we will discuss some of the ways that could be used to improve the system based on test results and examples from existing distributed file systems.
  • Osmani, Lirim (2013)
    With the recent advances in efficient virtualization techniques in using commodity servers cloud computing has emerged as a powerful technology to meet new requirements for supporting a new generation of computing services based on utility model. However barriers to widespread adoption still exists and the dominant platform is yet to be seen in years to come. Hence the challenge of providing scalable cloud infrastructures requires a continuous exploration of new technologies and techniques. This thesis describes an experimental investigation of integrating two such open source technologies, OpenStack and GlusterFS, to build our cloud environment. We designed a number of test case scenarios that help us answer the questions around performance, stability and scalability of the cloud infrastructure deployed. Additionally, the work based on this thesis was accepted to the Conference on Computing in High Energy and Nuclear Physics (CHEP2013), and the paper is due for publishing.
  • Koolaji, Mohsen (2014)
    Business ecosystems where services from enterprises across the world are marketed and acquired, demand efficient collaborative project management facilities. In particular, reputation and breach management systems are essential in partner selection and proper project delivery. Reputation systems need to provide measurable scales for collection of objective and arbitrable information about members of the ecosystem. In addition, how breaches or disputes can affect reputation of collaborating partners, and how such disputes can be resolved (i.e. breach recovery) are interesting questions. Furthermore, role of business process management (BPM) systems in resolving breach or dispute situations is also an interesting point for study. This thesis proposes modern model-driven reputation and breach management systems of its own, named Reputation and Breach Management System (RAB_MS). Purpose of the RAB_MS is to improve and refine trust between business partners in the business ecosystems. The presented models are based on state of the art techniques of service oriented architecture (SOA). The models are verified by formal automated verification mechanisms in YAWL system, to avoid syntactical, structural, and semantic errors, and interpretation ambiguities. Results of the formal verification mechanisms ensure that the business processes in the proposed reputation and breach management system meet necessary properties such as soundness and weak soundness. In simpler words, RAB_MS have no deadlocks, livelocks, or dead task within its business process models.
  • Raatikka, Vilho (Helsingin yliopistoUniversity of HelsinkiHelsingfors universitet, 2004)
  • Hämäläinen, Heikki (2016)
    Tämä työ tutkii Clojure-ohjelmointikieltä, joka on erityisesti rinnakkaisohjelmointiin suunniteltu Lisp-kielen murre. Clojure tukee vahvaa liitosta Java-ympäristöön ja sillä kirjoitetut ohjelmat suoritetaan JVM-virtuaalikoneella. Tutkielmassa käydään läpi Lisp-kielten historia, rinnakkaisohjelmoinnin yleiset haasteet ja funktionaalisen ohjelmointiparadigman perusteet. Lisäksi käsitellään Java-kielen ja JVM-virtuaalikoneen ja Clojure-kielen rinnakkaisohjelmointipiirteet. Tutkielman analyysiosassa verrataan Clojuren ja Javan rinnakkaisuusratkaisuja muun muassa tehokkuuden ja käytettävyyden osalta. Clojuren rinnakkaisuusratkaisuista transaktiomuisti osoittautui laskennallisesti hyvin raskaaksi. Lisäksi rinnakkaisratkaisujen lukottomuudesta seuraa se, että tietyt rinnakkaisohjelmointiongelmat ovat hankalia toteuttaa ilman, että käytetään Javan rinnakkaisratkaisuja. Erityisesti synkronisten rinnakkaisratkaisujen osalta kielessä olisi kehittämisen varaa. Javaan verrattuna Clojuren rinnakkaisuusratkaisut ovat hieman yksinkertaisempia käyttää. Tämä johtuu kuitenkin pitkälle Clojuren dynaamisesta tyypityksestä ja funktionaalisesta perusrakenteesta.
  • Kesseli, Henri (2013)
    Embedded systems are everywhere. The variety of different types of embedded systems and purposes are wide. Yet, many of these systems are islands in an age where more and more systems are being connected to the Internet. The ability to connect to the Internet can be taken advantage in multiple ways. One is to take advantage of the resources cloud computing can offer. Currently, there are no comprehensive overviews how embedded systems could be enhanced by cloud computing. In this thesis we study what cloud enhanced embedded systems are and what their benefits, risks, typical implementation methods, and platforms are. This study is executed as an extended systematic mapping study. The study shows that the interest from academia and practice in cloud enhanced embedded systems has been growing significantly in recent years. The most prevalent research area is wireless sensor networks followed by the more recent research area Internet of things. Most of the technology is available for implementing cloud enhanced embedded systems but comprehensive development tools such as frameworks or middlewares are scarce. Results of the study indicate that existing embedded systems and other non-computing devices would benefit from connectivity and cloud resources. This enables the development of new applications for consumers and industry that would not be possible without cloud resources. As an indication of this we see several systems developed for consumers such as remotely controlled thermostats, media players that depend on cloud resources, and network attached storage systems that integrate with cloud access and discovery. The academic literature is full of use cases for cloud enhanced embedded systems and model implementations. However, the actual integration process as well as specific engineering techniques are rarely explained or scrutinized. Currently, the typical integration process is very custom to the application. There are few examples of efforts to create specific development tools, more transparent protocols, and open hardware to support the development of ecosystems for cloud enhanced embedded systems.
  • Linnanvuo, Sami (Helsingin yliopistoUniversity of HelsinkiHelsingfors universitet, 2006)
    Online content services can greatly benefit from personalisation features that enable delivery of content that is suited to each user's specific interests. This thesis presents a system that applies text analysis and user modeling techniques in an online news service for the purpose of personalisation and user interest analysis. The system creates a detailed thematic profile for each content item and observes user's actions towards content items to learn user's preferences. A handcrafted taxonomy of concepts, or ontology, is used in profile formation to extract relevant concepts from the text. User preference learning is automatic and there is no need for explicit preference settings or ratings from the user. Learned user profiles are segmented into interest groups using clustering techniques with the objective of providing a source of information for the service provider. Some theoretical background for chosen techniques is presented while the main focus is in finding practical solutions to some of the current information needs, which are not optimally served with traditional techniques.
  • Lv, Guowei (2014)
    This master thesis discusses two main tasks of computational etymology. First, finding cognates in multilingual text. Second, finding underlying correspondence rules by aligning cognates. For the first part, I briefly described two categories of methods in identifying cognates: symbol based and phonetic based. For the second part, I described the Etymon project, which I had been working in. The Etymon project uses a probabilistic method and Minimum Description Length principle to align cognate sets. The objective of this project is to build a model which can automatically find as much information in the cognates as possible without linguistic knowledge as well as find genetic relationship between those languages. I also discussed the experiment that I did to explore the uncertainty in the data source.
  • Al-Hello, Muhammed (2012)
    Biological cell is complicated and complex environment, in which thousands of entities interact surprisingly among each other. This integrated device continuously receives internal and external signals to perform the most vital processes to keep the continuation of life. Even though thousands of interactions are catalysed in very small spaces, biologists assert there are no coincidences or accidental events. On the other hand, fast discoveries in biology and the rapid evolution in the data pool make it even more difficult to construct concrete perspective that scientifically interprets all observations. Thereby, co-operation has become necessary between biologists, mathematicians, physicists and computer engineers. The goal of this virtual corporation is pursuance what is known as modelling biological network. The final thesis is aiming to make comparisons across different computational tools, which are built for modeling biological network. Additionally, technical themes, such as reaction kinetic, are explained beforehand. These topics represent backbone in the software functionality. Beside the technical issues, the study will compare different features such as GUI, command line, importing/exporting files, etc.
  • Davoudi, Amin (2018)
    In the Internet age, malware poses a serious threat to information security. Many studies have been conducted on using machine learning for detecting malicious software. Although major breakthroughs have been achieved in this area, the problem has not been completely eradicated. In this thesis, we are going through the concept of utilizing machine learning for malware detection and conduct several experiments with two different classifiers (Support Vector Machine and Naive Bayes) to compare their ability to detect malware based on Port-able Executable (PE) file format headers. A malware classifier dataset built with header field values of portable executable files was obtained from GitHub and used for experimental part of the thesis. We conducted 5 different experiments with several different trial settings. Various statistical methods have been used to assess the significance of the results. The first and second experiment show that using SVM and Naive Bayes classification methods for our dataset can result in high sensitivity rate. In the rest of the experiments, we focus on ac-curacy rate of both classifiers with different settings. The results show that although there were no big differences in the accuracy rates of the classifiers, the value of variance of ac-curacy rates is greater in Naive Bayes than in SVM. The study investigates ability of two different methods to classify information in their distinctive way. It also provides evidences that show that the learning-based approach provides a means for accurate automated analysis of malware behavior which helps in the struggle against malicious software.
  • Palon, Preston (2015)
    A recommender system suggests items that the user of the system is likely to find valuable. Together with the explosion of e-commerce, recommender systems have become a focus of academic research. Within this field prediction of film ratings is a popular research area and the topic of this thesis. Many websites that sell, rent or stream films allow users to give ratings to films they have seen. The goal is to accurately predict ratings the user has not given yet. It would then be possible to recommend films the user may want to see. Different ways of predicting film ratings in recommender systems were compared at this thesis using MovieLens 100K as the dataset. The algorithms were implemented in MATLAB, tested using 5-fold cross-validation, and ranked using mean absolute error as the accuracy metric. In total nine different recommender system designs were tested, including four hybrid systems designed and created for this thesis. Techniques used include user and item-based collaborative filtering, singular value decomposition, content-based recommendation and demographic method. Separate tuning data was used to optimise parameters including similarity measure used and the best nearest neighbourhood size. Of the basic methods item-based collaborative filtering gave the best results, followed by singular value decomposition. User-based collaborative filtering, content-based recommendation and demographic method performed slightly worse. The overall best results were achieved with a hybrid design that combines baseline predictors with user-based and item-based collaborative filtering. Choosing the best similarity measure and finding ideal values for parameters like nearest neighbourhood size had a significant impact on the results.
  • Kuosmanen, Anna (2013)
    A recently developed protocol for sequencing RNA in a cell in a high-throughput manner, RNA-seq, generates from hundreds of thousands to a few billion short sequence fragments from each RNA sample. Aligning these fragments, or 'reads', to the reference genome in a fast and accurate manner is a challenging task that has been tackled by many researchers over the past five years. In this thesis I review the process of RNA-seq data creation and analysis, and introduce and compare some of the popular alignment software. As part of the thesis, I implemented an alignment software based on the novel idea of a limited range BWT-transformed index. This software, called SpliceAligner, is also introduced in detail. In addition to my own software, I chose for comparison Tophat, SpliceMap, MapSplice, SOAPsplice and SHRiMP2. I tested the chosen software on simulated data sets with read lengths of 50, 100, 150 and 250 base pairs, as well as with data from a real RNA-seq experiment. I ranked the software based on the running time, number of reads mapped and the accuracy of the alignments. I also predicted transcripts from the alignments of the simulated data, and measured the correctness of the predictions. With read lengths of 50 base pairs, 100 base pairs and 150 base pairs, speed, alignment accuracy and ease of use make Tophat a solid top choice. MapSplice is a comparable choice in speed and alignment accuracy, and SOAPsplice is only slightly behind, but their user interfaces are much more complicated. However, Tophat slowed down significantly as the read length increased to 250 base pairs and SOAPsplice completely failed to run with 250 base pairs long reads. This leaves MapSplice as the top choice for long reads in most cases. My software SpliceAligner was competitive in the alignment accuracy with the top choices, but there still remains work to be done on the running speed as well as on multiple small optimizations.
  • Guo, Haipeng (2016)
    Along with the proliferation of smartphones, smartphone context-aware applications are gaining more and more attention from manufactures and users. With the capability to infer user's context information i.e., if the user is in a meeting, driving, running or at home, smartphone applications can react accordingly. However, limiting factors such as limited battery capacity, computing power and inaccuracy of inference caused by the in-accurate machine learning models and sensors hinder the large deployment of context-aware applications. In this master thesis, I develop CompleSense, a cooperative sensing framework designed for Android devices that facilitates the establishment and management of cooperation group so that developers can further exploit the potentials of cooperative sensing without worrying about the implementation of system monitoring, data throttling, aggregation and synchronization of data streams and wireless message passing via Wi-Fi. The system adopts Wi-Fi Direct technology for service advertisement and peer discovery. Once the cooperative group is formed, devices can share sensing and computing resources within short range via Wi-Fi connection. CompleSense allows developers to customize the system based on their own optimization needs, e.g., optimizing the trade-offs of cooperative sensing. System components are loosely coupled to ensure extensibility, resilience and scalability of the system, so that failure or change of a single component will not affect the remaining parts of the system. Developers can extend from the current system by adding customized data processing kernels, machine learning models and optimized sharing schemes. In addition to that, CompleSense abstracts the controlling logic of sensors, developers can easily integrate new sensors into the system by following a pre-defined a programming interface. The performance of CompleSense is evaluated by carrying out a cooperative audio similarity calculation task with varied number of clients which also confirms that CompleSense is feasible to be deployed for lower tier devices, such as Motorola Moto G.