Introduction to Spark Optimization

Starting the series of Spark optimization, covering different level of optimization.

Jul 08, 2019

Optimizing Spark is very interesting as it shows how the hardware and the software blend together in such a way to give a feel of some great achievement. I love the feeling whenever I am able to increase the performance, from one-hour job down to ten minutes, how? I will share the optimizing techniques in the following series of articles that I have planned for the next few weeks.

Spark optimization is done at several places, it is like a team game where each optimization plays an important role in the overall performance. Following are the types of optimization needed when working on Spark.

Cluster level optimization (AWS) - Read Blog

Type, Size (cores, memory, storage)
Bootstrap and steps

Job level optimization(Spark-submit/shell) - Read Blog

Executor and Driver memory and cores
Configuration(Garbage Collection, Overhead, partitions and parallelism)

I/O level optimization (AWS, HDFS, Scala) - Read Blog

Data input and output
Partitioning and bucketing
Compression

Code level optimization (Scala) - Read Blog

Avoiding shuffle
Filtering
Caching
Salting