All You Need To Know About ORC File Structure In Depth

All You Need To Know About ORC File Structure In Depth
Rohan Karanjawala
Posted by on June 19, 2017 in Blog

All You Need To Know About ORC File Structure In Depth

Want to store data in Hive tables, just wondering which file format to use, ORC or Parquet?

Well this is a question which many have tried to answer in various ways.

As a first step towards finding this out, let us try to understand how is the Optimized Row Columnar (ORC) file format different compared to our usual flat file.

ORC is a columnar file format. You can visualize the structure of an ORC file as an area that is divided into Header, body and footer.

The Header contains the text ‘ORC’ in case some tools require to determine the type of file while processing.

 

 

The body contains the actual data as well as the indexes. Actual data is stored in the ORC file in the form of rows of data that are called Stripes. Default stripe size is 250 MB.

Stripes are further divided into three more sections viz the index section that contains a set of indexes for the stored data, the actual data and a stripe footer section. One interesting thing to note here is that both index and data section are stored as columns so that only the columns where the required data is present, is read. Index data consists of min and max values for each column as well as the row positions within each column. ORC indexes help to locate the stripes based on the data required as well as row groups. The Stripe footer contains the encoding of each column and the directory of the streams as well as their location.

 

The footer section consists of three parts viz. file metadata, file footer and postscript.

The file Metadata section contains the various statistical information related to the columns and this information is present at a stripe level. These statistics enable input split elimination based on predicate push down which are evaluated for each stripe. The file footer contains information regarding the list of stripes in the file, number of rows per stripe, and the data type for each column. It also contains aggregates counts at column-level like min, max, and sum. The Postscript section contains the file information like the length of the file’s Footer and Metadata sections, the version of the file, and the compression parameters like general compression used (eg. none, zlib, or snappy) and the size of the compressed folder.

I’m sure now you would have a much better understanding of the ORC file format structure which would help you make a better decision in selection of the file formats. Of course, now your next step would be to compare file formats like ORC and Parquet and come to a conclusion which one is better suited to your project.

Are you confused where I should begin on this comparison?  Well, in my next blog I would be taking you through on how to compare various file formats, to try and help out those looking forward to doing this activity.  Stay Tuned!!!

About Author:

Rohan is senior manager at Ellicium solutions pvt ltd, who looks after projects in Big Data, IOT and Analytics area, helping businesses to stay ahead in the competition.