Hashing-based delayed duplicate detection as an approach to improve the scalability of optimal Bayesian network structure learning with external memory frontier breadth-first branch and bound search

Hashing-based delayed duplicate detection as an approach to improve the scalability of optimal Bayesian network structure learning with external memory frontier breadth-first branch and bound search

dc.date.accessioned	2015-05-25T11:30:00Z	und
dc.date.accessioned	2017-10-24T12:24:01Z
dc.date.available	2015-05-25T11:30:00Z	und
dc.date.available	2017-10-24T12:24:01Z
dc.date.issued	2015-05-25T11:30:00Z
dc.identifier.uri	http://radr.hulib.helsinki.fi/handle/10138.1/4729	und
dc.identifier.uri	http://hdl.handle.net/10138.1/4729
dc.title	Hashing-based delayed duplicate detection as an approach to improve the scalability of optimal Bayesian network structure learning with external memory frontier breadth-first branch and bound search	en
ethesis.discipline	Computer science	en
ethesis.discipline	Tietojenkäsittelytiede	fi
ethesis.discipline	Datavetenskap	sv
ethesis.discipline.URI	http://data.hulib.helsinki.fi/id/1dcabbeb-f422-4eec-aaff-bb11d7501348
ethesis.department.URI	http://data.hulib.helsinki.fi/id/225405e8-3362-4197-a7fd-6e7b79e52d14
ethesis.department	Institutionen för datavetenskap	sv
ethesis.department	Department of Computer Science	en
ethesis.department	Tietojenkäsittelytieteen laitos	fi
ethesis.faculty	Matematisk-naturvetenskapliga fakulteten	sv
ethesis.faculty	Matemaattis-luonnontieteellinen tiedekunta	fi
ethesis.faculty	Faculty of Science	en
ethesis.faculty.URI	http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI	http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university	Helsingfors universitet	sv
ethesis.university	University of Helsinki	en
ethesis.university	Helsingin yliopisto	fi
dct.creator	Jahnsson, Niklas
dct.issued	2015
dct.language.ISO639-2	eng
dct.abstract	Bayesian networks are graphical models used to represent the joint probability distribution for all variables in a data set. A Bayesian network can be constructed by an expert, but learning the network from the data is another option. This thesis is centered around exact score-based Bayesian network structure learning. The idea is to optimize a scoring function measuring the fit of the network to the data, and the solution represents an optimal network structure. The thesis adds to the earlier literature by extending an external memory frontier breadth-first branch and bound algorithm by Malone et al. [MYHB11], which searches the space of candidate networks using dynamic programming in a layered fashion. To detect duplicates during the candidate solution generation, the algorithm uses efficiently both semiconductor and magnetic disk memory. In-memory duplicate detection is performed using a hash table, while a delayed duplicate detection strategy is employed when resorting to the disk. Delayed duplicate detection is designed to work well against long disk latencies, because hash tables are currently still infeasible on disk. Delayed duplicate detection allows the algorithm to scale beyond search spaces of candidate solutions fitting only in the comparatively expensive and limited semiconductor memory at disposal. The sorting-based delayed duplicate detection strategy employed by the original algorithm has been found to be inferior to a hashing-based delayed duplicate strategy in other application domains [Kor08]. This thesis presents an approach to use hashing-based delayed duplicate detection in Bayesian network structure learning and compares it to the sorting-based method. The problem faced in the hashing of candidate solutions to disk is dividing the candidate solutions in a certain stage of the search in an efficient and scalable manner to files on the disk. The division presented in this thesis distributes the candidate solutions into files unevenly, but takes into account the maximum number of candidate solutions that can be held in the semiconductor memory. The method works in theory as long as operating system has free space and inodes to allocate for new files. Although the hashing-based method should in principle be faster than the sorting-based, the benchmarks presented in this thesis for the two methods show mixed results. However, the analysis presented also highlights the fact that by improving the efficiency of, for example, the distribution of hashed candidate solutions to different files, the hashing-based method could be further improved.	en
dct.language	en
ethesis.language.URI	http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language	English	en
ethesis.language	englanti	fi
ethesis.language	engelska	sv
ethesis.thesistype	pro gradu-avhandlingar	sv
ethesis.thesistype	pro gradu -tutkielmat	fi
ethesis.thesistype	master's thesis	en
ethesis.thesistype.URI	http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn	URN:NBN:fi-fe2017112251118
dc.type.dcmitype	Text

Files in this item

Files	Size	Format	View
engl_malli.pdf	671.9Kb	PDF

This item appears in the following Collection(s)

Faculty of Science [4203]

Show simple item record

Hashing-based delayed duplicate detection as an approach to improve the scalability of optimal Bayesian network structure learning with external memory frontier breadth-first branch and bound search

Files in this item

This item appears in the following Collection(s)

Yhteystiedot

HELSINGIN YLIOPISTO