Crushing AVRO Small Files with Spark

Solving the many small files problem for AVRO

photo of Ian Duffy
Ian Duffy

Software Engineer

photo of Kevin Eid
Kevin Eid

Data Engineer

Posted on Feb 06, 2018

Solving the many small files problem for AVRO

null

The Fashion Content Platform teams in Zalando Dublin handle large amounts of data on a  daily basis. To make sense of it all, we utilise Hadoop (EMR) on AWS. Within this post, we discuss a system where a real-time system feeds the data. Due to the variance in data volumes and the period that these systems write to storage, there can be a large number of small files.

null

While Hadoop is capable processing large amounts of data it typically works best with a small number of large files, and not with a large number of small files. A small file is one which is smaller than the Hadoop Distributed File System (HDFS) block size (default 64MB). In MapReduce, every map task handles computation on a single input block. Having many small files means that there will be a lot of map tasks, and each map task will handle small amounts of data. This creates a larger memory overhead and slows down the job. Additionally, when using HDFS backed by AWS S3, listing objects can take quite a long time and even longer when lots of objects exist. [3]

Known solutions and why they don’t work for AVRO

This is a well-known problem; there are many utilities and approaches for solving the issue:

  1. s3-dist-cp - This is a utility created by Amazon Web Services (AWS). It is an adaptation of Hadoop’s DistCp utility for HDFS that supports S3. This utility enables you to solve the small file problem by aggregating files together using the --groupBy option and by setting a maximum size using the --targetSize option.
  2. Filecrush - This is a highly configurable tool designed for the sole purpose of “crushing” small files together to solve the small file problem.
  3. Avro-Tools - This supplies many different functions for reading and manipulating AVRO files. One of these functions is “concat” which works perfectly for merging AVRO files together. However, it’s designed to be used on a developer’s machine rather than on a large scale scheduled job.

While both these utilities exist, they do not work for our use case. The data produced by our system is stored as AVRO files. These files contain a file header followed by one or more blocks of data. As such, a simple append will not work and doing so results in corrupt data. Additionally, the Filecrush utility doesn’t support reading files from AWS S3.

null

We decided to roll out our own solution. The idea was straightforward: Use Spark to create a simple job to read the daily directory of the raw AVRO data and re-partition the data using the following equation to determine the number of partitions needed to write back the larger files:

number_of_partitions = input_size / (AVRO_COMPRESSION_RATIO * DEFAULT_HDFS_BLOCK_SIZE)

Our initial approach used spark-avro by Databricks to read in the AVRO files and write out the grouped output. However, on validation of the data, we noticed an issue; the schema in the outputted data was completely mangled. With no workaround to be found, we reached out to our resident big data guru Peter Barron who saved the day by introducing us to Spark’s newAPIHadoopFile and saveAsNewAPIHadoopFile methods which allowed us to read and write GenericRecords of AVRO without modifying the schema.

Conclusion

To put it in a nutshell, we were able to solve the many small files problem in AVRO by writing a Spark job leveraging the low level functionalities of the Hadoop fs library. In effect, repartitioning files to be able to work on bigger blocks of data will improve the speed of future jobs by decreasing the number of map tasks needed and reducing the cost of storage.

null

We're looking for software engineers and other talents. For details, check out our jobs page.

References

[1] Dealing with Small Files Problem in Hadoop Distributed File System, Sachin Bendea and Rajashree Shedge, https://www.sciencedirect.com/science/article/pii/S1877050916002581 [2] The Small Files Problem, Cloudera Engineering Blog, https://blog.cloudera.com/blog/2009/02/the-small-files-problem/ [3] Integrating Spark At Petabyte Scale, Netflix, https://events.static.linuxfound.org/sites/events/files/slides/Netflix%20Integrating%20Spark%20at%20Petabyte%20Scale.pdf



Related posts