Introduction to Spark Optimization
Starting the series of Spark optimization, covering different level of optimization.
Optimizing Spark is very interesting as it shows how the hardware and the software blend together in such a way to give a feel of some great achievement. I love the feeling whenever I am able to increase the performance, from one-hour job down to ten minutes, how? I will share the optimizing techniques in the following series of articles that I have planned for the next few weeks.
Spark optimization is done at several places, it is like a team game where each optimization plays an important role in the overall performance. Following are the types of optimization needed when working on Spark.
Cluster level optimization (AWS) - Read Blog
Type, Size (cores, memory, storage)
Bootstrap and steps
Job level optimization(Spark-submit/shell) - Read Blog
Executor and Driver memory and cores
Configuration(Garbage Collection, Overhead, partitions and parallelism)
I/O level optimization (AWS, HDFS, Scala) - Read Blog
Data input and output
Partitioning and bucketing
Compression
Code level optimization (Scala) - Read Blog
Avoiding shuffle
Filtering
Caching
Salting