AbstractsComputer Science

Evaluation of different storage systems for Apache Spark and Apache Hadoop:

by T.S. Hubregtsen

Institution: Delft University of Technology
Year: 2015
Keywords: Apache Spark; Apache Hadoop; Big Data; IBM Power Systems; Flash; CAPI Flash
Record ID: 1251741
Full text PDF: http://resolver.tudelft.nl/uuid:df99a960-dc9c-460e-a066-1a0e12f791a1


Big Data systems have been used for multiple years to solve problems that require scale. A framework takes care of scalability and resiliency issues, and allows the user to focus on relevant computation, in the form of map and reduce functions. In these Big Data systems, we currently see a shift from the traditional use of the Hard Disk Drive (HDD) towards in-memory computation. In this thesis, we have theoretically evaluated these two generations of Big Data systems, as well as two implementations, being Apache Hadoop and Apache Spark, in combination with Flash technology. We have also evaluated the possible use of Flash technology in these Big Data sys- tems, by performing two experiments. Our first experiment exam- ined the performance of Apache Spark versus Apache Hadoop for a representative iterative algorithm, and the performance degradation of Apache Spark under memory constrains. We have found that, for the chosen algorithm, Apache Spark performs equal-or-better compared to Apache Hadoop when data has to be loaded from the HDD, such as is the case of the initialization phase of a program. For the iterative part of our program, we have seen an overall speedup of 30, and a speedup of 100 for the map and reduce phases. In our second experiment, we evaluated two ways of using Flash, in particular using the IBM FlashSystem 840 connected to a Power8, in Apache Spark. We have first evaluated Flash technology with a mounted file system, and used this setup to replace the HDD as default spill loca- tion. We have found that this was not valuable, as the possible performance improvements were negligible compared to the overhead generated by data aggregation and system calls. We then shifted our focus to CAPI connected Flash, and modified Apache Spark to spill intermediate data directly to the FlashSystem using key value pairs. In our experiment, while limiting the memory to a fixed amount to force spilling, we were able to remove 70% of the overhead caused by spilling. This was mainly overhead of Operating System (OS) involvement. In our future work, we will address the overhead caused by data aggregation, as we can write smaller amounts of data, because we are writing in key value pairs. We believe that, once this overhead is removed, Big Data systems can benefit from Flash technology, and especially CAPI Flash technology, as one can use a system with a limited amount of expensive DRAM and a large Flash backend, to solve larger problem sets while maintaining a performance equal to in-memory computation.