Renaming Spark Part-NNNN Files on S3
We have seen a big issue with Spark job, which is, it writes its output files with part-nnnn
naming due to its distributed behavior, and its not possible to rename it directly before writing, or modifying the underlying functions is not that easy.
The only way to carry out this task is to do it on s3 directly once the file has been written. If you have few files you can do manually on the web interface of s3, but if you have many files in different folders, you can use S3 SDK.
The code is quite simple and easy to use, below is the example in scala:
// Initializing path variables
val location = “s3://bucket/”
val bucket_name = “bucket” // bucket name is the name of the bucket not the path,
// it will not work if there is “s3://” or “/”.
// Setting up S3 Client
val s3 = AmazonS3ClientBuilder.defaultClient()
val file_name = “part-NNNN”
// Copying file to renamed one
s3.copyObject(bucket_name, file_name, bucket_name, "renamed_file.txt")
// Deleting old file
s3.deleteObject(bucket_name, file_name)
// To get file name you can use
// Setting up file system using java.net.URI
val fs = FileSystem.get(new java.net.URI(location), spark.sparkContext.hadoopConfiguration)
// Getting list of files from the location
val file_list = fs.listStatus(new Path(location))
The file_list
variable will hold the list of files, just like when you use ls
command.
This is the basic version of code, it has to be changed based on your needs, for example for multiple files, it will need a loop, for searching part files among other files, it will require a pattern search.