Skip to main content
Login | Suomeksi | På svenska | In English

Hashing-based delayed duplicate detection as an approach to improve the scalability of optimal Bayesian network structure learning with external memory frontier breadth-first branch and bound search

Show simple item record

dc.date.accessioned 2015-05-25T11:30:00Z und
dc.date.accessioned 2017-10-24T12:24:01Z
dc.date.available 2015-05-25T11:30:00Z und
dc.date.available 2017-10-24T12:24:01Z
dc.date.issued 2015-05-25T11:30:00Z
dc.identifier.uri http://radr.hulib.helsinki.fi/handle/10138.1/4729 und
dc.identifier.uri http://hdl.handle.net/10138.1/4729
dc.title Hashing-based delayed duplicate detection as an approach to improve the scalability of optimal Bayesian network structure learning with external memory frontier breadth-first branch and bound search en
ethesis.discipline Computer science en
ethesis.discipline Tietojenkäsittelytiede fi
ethesis.discipline Datavetenskap sv
ethesis.discipline.URI http://data.hulib.helsinki.fi/id/1dcabbeb-f422-4eec-aaff-bb11d7501348
ethesis.department.URI http://data.hulib.helsinki.fi/id/225405e8-3362-4197-a7fd-6e7b79e52d14
ethesis.department Institutionen för datavetenskap sv
ethesis.department Department of Computer Science en
ethesis.department Tietojenkäsittelytieteen laitos fi
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university Helsingfors universitet sv
ethesis.university University of Helsinki en
ethesis.university Helsingin yliopisto fi
dct.creator Jahnsson, Niklas
dct.issued 2015
dct.language.ISO639-2 eng
dct.abstract Bayesian networks are graphical models used to represent the joint probability distribution for all variables in a data set. A Bayesian network can be constructed by an expert, but learning the network from the data is another option. This thesis is centered around exact score-based Bayesian network structure learning. The idea is to optimize a scoring function measuring the fit of the network to the data, and the solution represents an optimal network structure. The thesis adds to the earlier literature by extending an external memory frontier breadth-first branch and bound algorithm by Malone et al. [MYHB11], which searches the space of candidate networks using dynamic programming in a layered fashion. To detect duplicates during the candidate solution generation, the algorithm uses efficiently both semiconductor and magnetic disk memory. In-memory duplicate detection is performed using a hash table, while a delayed duplicate detection strategy is employed when resorting to the disk. Delayed duplicate detection is designed to work well against long disk latencies, because hash tables are currently still infeasible on disk. Delayed duplicate detection allows the algorithm to scale beyond search spaces of candidate solutions fitting only in the comparatively expensive and limited semiconductor memory at disposal. The sorting-based delayed duplicate detection strategy employed by the original algorithm has been found to be inferior to a hashing-based delayed duplicate strategy in other application domains [Kor08]. This thesis presents an approach to use hashing-based delayed duplicate detection in Bayesian network structure learning and compares it to the sorting-based method. The problem faced in the hashing of candidate solutions to disk is dividing the candidate solutions in a certain stage of the search in an efficient and scalable manner to files on the disk. The division presented in this thesis distributes the candidate solutions into files unevenly, but takes into account the maximum number of candidate solutions that can be held in the semiconductor memory. The method works in theory as long as operating system has free space and inodes to allocate for new files. Although the hashing-based method should in principle be faster than the sorting-based, the benchmarks presented in this thesis for the two methods show mixed results. However, the analysis presented also highlights the fact that by improving the efficiency of, for example, the distribution of hashed candidate solutions to different files, the hashing-based method could be further improved. en
dct.language en
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language English en
ethesis.language englanti fi
ethesis.language engelska sv
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.thesistype.URI http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn URN:NBN:fi-fe2017112251118
dc.type.dcmitype Text

Files in this item

Files Size Format View
engl_malli.pdf 671.9Kb PDF

This item appears in the following Collection(s)

Show simple item record