Big Data

File Format Load Speeds: Parquet vs. Avro

We do a lot of work using HDFS as our file system. It provides us with a fault tolerant distribute environment to store our files. The two main file formats used with HDFS are Avro and Parquet. Since Parquet is a compressed format, I wondered how the performance of Parquet writes compare to Avro writes. So I setup some tests to figure this out.

Since Impala cannot write to Avro, I will be needing to perform these inserts in Hive (which will make the inserts longer). So I fire up my ole' Hive console and get started. First I need to create a table to insert into. I created one that is Avro backed and one that is Parquet backed. Then I insert into each of these tables 8 times with an "insert overwrite table avro select * from insertTable". The insertTable has 913304 Rows and 63 Columns with a partition column on the month. The resulting size of the Avro and Parquet directories on HDFS are 673.6M and 76.8M respectively. The following table displays the results.

Avro Parquet
128.121 114.786
99.132 112.18
97.644 125.045
97.627 97.107
97.17 96.553
98.832 104.718
97.14 108.456
97.104 98.2
AVERAGE AVERAGE
101.59 107.13

So I am seeing Parquet being almost 1/10th the size of Avro but only taking ~6% more time to write. If the application use pattern is one that is write once read many times, it makes sense to go with Parquet. But, depending on your data profile, even if you are writing many times, it may still make sense to use Parquet.  There may be other sets of data which widen the gap between these two formats. If anyone has any insight on how to do this, I could rerun these tests with updated data. Just leave a comment.

Next step is to Compare the write performance of Parquet vs Kudu.