Introduction to Spark Optimization
OptimizingSpark is very interesting as it shows how the hardware and the software blendtogether in such a way to give a feel of some great achievement. I love thefeeling whenever I am able to increase the performance, from one-hour job downto ten minutes, how? I will share the optimizing techniques in the following series of articles that I have planned for the next few weeks.
Spark optimization is done at several places, it is like a team game where each optimization plays an important role in the overall performance. Following are the types of optimization needed when working on Spark.
Cluster level optimization (AWS) - Read Blog
Type, Size (cores, memory, storage)
Bootstrap and steps
Job level optimization(Spark-submit/shell) - Read Blog
Executor and Driver memory and cores
Configuration(Garbage Collection, Overhead, partitions and parallelism)
I/O level optimization (AWS, HDFS, Scala) - Read Blog
Data input and output
Partitioning and bucketing
Compression
Code level optimization (Scala) - Read Blog
Avoiding shuffle
Filtering
Caching
Salting