More Videos...
 

Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture

Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the Cloudburst application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA [15] over both default HDFS and Lustre.

Recent Projects

More +