ORC Vs Parquet Vs Avro : How to select a right file format for Hive?
ORC Vs Parquet Vs Avro : Which one is the better of the lot? People working in Hive would be asking this question more often. There have been many interesting discussions around this.
Contributing my two cents, I’ll also answer this. At Ellicium, we have come across this question many a times. We have answered it quite successfully, doing this multiple time for our customers.
Why don’t you try it out yourself first?
You can use a sample dataset and get familiar with the comparison framework before trying out with actual data.
Can start with commonly available dataset which you can download and use – https://openflights.org/data.html.
You can download the data ‘airlines.dat’ and ‘airports.dat’ from the below links – https://raw.githubusercontent.com/jpatokal/openflights/master/data/airlines.dat.
The performance metrics to be captured during this process (to keep it simple for now) would the time taken for data loading into hive tables, the compression factor achieved and time taken for the query execution.
It is highly recommended to perform ORC or Parquet or Avro comparison activity on large/actual volumes of data to reach an appropriate conclusion based on your requirement. The airlines data used here is just for the sake of taking an example.
Follow the steps below to start on this interesting and revelling journey –
Data Loading into Hive –
First step would be to get the data available in Hive.
We need 3 copies of the ‘airlines’ table and ‘airports table created in Hive which would be storing data in ORC/Parquet/Avro format. Let us call them ‘airlines_orc’ and ‘airlines_parquet’ and ‘airlines_avro’ and similarly for the ‘airports’ table. There are various options for doing this.
One option would be to load the csv data for airlines and countries directly into both the Hive tables.
Other option would be to create an external table in Hive pointing to the source files on HDFS, and then moving this data from external table to the hive table using ‘insert overwrite table’ command.
Performance Metric – One of the performance metric to capture here would be the actual loading time taken for loading of data into ‘airlines_orc’ and airlines_parquet’ and ‘airlines_avro’.
Check the total underlying file size to figure out the compression factor achieved, which should a considerable factor in case of huge datasets.
Also check out the table properties to gain more insights on the details regarding the underlying files created.
Query Execution –
Next, we move onto the most important task of capturing the performance metrics for query execution.
- Once the data has been loaded into hive, we need to execute couple of different types of queries to capture the execution time.
We can use the below type of queries for execution –
- Simple select query with order by.
- Simple count(*) queries for e.g. select count(airlineid) from airlines.
- Finding Distinct counts for e.g. select count(distinct airlineid) from airlines.
- Using joins for e.g. joining airlines and airports table
- To use aggregate functions like max, min etc in the queries.
- To use group by with various combinations of having and order by clause.
- Using subqueries.
- Executing queries with various combinations of join and sub queries, join and group by, subquery and group by, case statements etc.
- Queries that would be using a complex logic to answer some business questions.
- Most important and useful would be to execute frequently used queries based on the project or client requirement.
Performance Metric – Capture the execution time for these queries as well as various other factors for e.g. number of mappers and reducers used.
Once the above step has been completed, you would get a fair idea. Wherein based on your type of data and queries, which format is suitable for your kind of requirements. We need to control hive environment for query execution. We can achieve this by trying to set various Hadoop parameters. Parameters like the split size which affects the number of mappers, number of reducers etc. And then executing all the above queries and again capturing the performance metrics like query execution time, number of mappers and reducers used etc.
Comparison and Conclusion –
So just a recap,
we have captured performance metrics for loading data onto the Hive ORC/Parquet tables as well as measured the compression factor achieved.
We have captured query execution times for various different types of queries ranging from simple to complex involving joins, aggregation, business requirements etc.
To confirm the outcome, we have further gone and setup a controlled environment. By setting up certain Hadoop parameters impacting Hive queries.
Based on all the results we have captured, it should be pretty straight forward now to conclude, which is the best format to store and query data on Hive based on your requirements.
We have done this kind of comparison exercises involving more complex steps for many of our customers. If you ask me which is the best file format, the answer is it depends on the customer requirement.
Interestingly, we have found this file format choice to depend on various parameters. Parameters like the amount of data customer needs to manage, the kind of queries frequently executed, the available infrastructure, various kinds of optimizations performed etc. For some situations ORC gave better results based on the above parameters. While for some Parquet won the bet and others went with Avro.
Hope this provides a starting point for those looking forward to deciding on the file format for storing huge amounts of data as well as faster query retrieval, based on their project requirements.
In my next blog, I will share our findings in one of our projects at Ellicium, wherein we compared the ORC, Parquet and Avro format files for data storage and query execution on Hive.
Stay tuned for this very interesting piece of information!
Read my previous articles about ORC file structure:
All You Need To Know About ORC File Structure In Depth : http://bit.ly/2shvver