Skip to main content
Login | Suomeksi | På svenska | In English

Optimizing MapReduce For Strongly Heterogeneous Environments

Show full item record

Title: Optimizing MapReduce For Strongly Heterogeneous Environments
Author(s): Kåll, Simon
Contributor: University of Helsinki, Faculty of Science, Department of Computer Science
Language: English
Acceptance year: 2014
Abstract:
Over the last ten years MapReduce has emerged as one of the staples of distributed computing both in small and large scale applications. MapReduce has successfully been employed to perform batch parallel computing applications such as web indexing and data mining. Especially Hadoop, an open source implementation of the MapReduce model has become widely adopted and researched. In MapReduce the input data typically consists of a long list of key/- values pairs which has been split up into smaller parts and stored in the cluster performing the computation. The computation consists of two distinct steps, map and reduce. In the map step nodes are assigned input splits, which they process by applying the user supplied map function to each element of the designated part of the list. The result of the Map step is a new intermediate list of key/value pairs which constitutes the input for the reduce step. In the reduce step, a user supplied reduce function is applied to the intermediate data. The reduce function performs a summary operation on the elements in the intermediate data list, the result of which is the output for the MapReduce job. The performance of a MapReduce implementation is closely tied to its scheduler algorithm. The scheduler decides when and on which node the map and reduce tasks of the computation are executed in the cluster. The implementation of the scheduler in Hadoop and other systems relies on the underlying cluster being relatively homogenous with task progressing in a linear fashion. Experience has however shown that this is rarely the case. Differing hardware generations, faults in both hardware and software as well as varying workloads all contribute to make the environment MapReduce runs in far from homogeneous. In this thesis the performance of nodes executing reduce tasks is shown to strongly correlate with the run-time of the MapReduce job. This correlation is utilized to improve performance in an heterogeneous environment though a reduce delay scheduling algorithm. The algorithm schedules reduce tasks based on historic node performance in order to minimize the likelihood of reduce tasks being executed on poorly performing nodes. In the best case scenario the algorithm improves performance under heterogeneity, and even in the worst case minimizes the effect of heterogeneity. This thesis demonstrates how with heterogeneity modeled as a normal distribution of node performance, reduce delay scheduling decreases MapReduce job run-times with up to 30% when compared to a homogeneous model of node performance.


Files in this item

Files Size Format View
thesis.pdf 358.0Kb PDF

This item appears in the following Collection(s)

Show full item record