Optimizing Spark I/O

Jul 27, 2019

Reading and writing files in Spark is part of most of the jobs, and to make it work the best way there are some approaches and techniques that should be used in order to make the I/O efficient. This article will refer to AWS S3, if you are unaware of that filesystem please read the AWS Docs.

Using Local File System

Reading and writing data is very performant when accessing to and from the local file system like HDFS, its always worth to look into how well the local storage is performing against the external storage like AWS S3. There is always an overhead when interacting with external systems, common problems like the network or the API issues can come up anytime thus making your job slower.

Data Partitioning

Partitioning the data increases the performance overall, a well written partitioned data in local or external storage is accessed very quickly, reducing the load time. It also helps the spark queries as the driver distributes the data based on the keys that’s partitioned evenly. Similarly, it also gives a boost to other teams as they can load or access the data very quickly using their tools.

// by column 
daraframe.repartition()

Data Bucketing

Bucketing is an optimizing technique used when writing the data, its main purpose is to bucket columns to avoid shuffling. Its only useful to consider bucketing those dataframes that are shuffled more than once so a pre-shuffle through bucketing actually makes sense (Bucketing is expensive). Bucketing is enabled by default in Spark.

spark.sql.sources.bucketing.enabled
dataframe.write.bucketBy()

Recommended Reading: Bucketing,

Great Answer Partitioning vs Bucketing

Data Compression

It is always recommended to use compression techniques along with modern format like Apache Parquet or Apache ORC for better performance. Parquet supports compression known as snappy, it is very fast, and the compression ratio is quite good compared to the traditional gzip. Spark have built-in read and write functions for Parquet files:

spark.read.parquet()
dataframe.write.parquet()

Optimal File Size

The file size should be same across the data, it should not be like 10 files of 5GB or 10000000 files of 5kb, it should be optimal based on the partitions and bucket size.