ErasmusProject

Performance Optimization System for Hadoop and Spark Frameworks

February 19, 2021 ALL4RD

  
Summary

 The optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model,
implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several 
machines. Data compression reduces data size and transfer time between disks and memory but requires additional 
processing. Therefore, finding an optimal trade-off is a challenge, as a high compression factor may underload 
Input/Output but overload the processor.The project aims to present a system enabling the selection of the compression
tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures 
based on simulation analyses.

Outcomes

 In this project, a system enabling to find an optimal trade-off to reach optimal performance in Apache Hadoop
and Spark frameworks will be developed. The method will be evaluated for diverse applications, including TestDFSIO, 
TeraSort, WordCount, LogAnalyzer, and K-means. It is planned to study the energy-efficient data transfers of
 Apache Hadoop and Spark using RDMA-capable networks like InfiniBand based on the developed methodology and techniques.

Partners

National Polytechnic University of Armenia (NPUA)
Institute for Informatics and Automation Problems of the National Academy of Sciences of the Republic of Armenia (IIAP)
Toulouse Research Institute in Computer Science
University of Vaasa – UVA

NPUA/IIAS R&D Unit

Performance Optimization System for Hadoop and Spark Frameworks