Optimal column count for ORC and Parquet

Question

I am using PySpark to read the files and have query regarding the maximum number of columns that can be handled -

What is optimal column count for ORC and Parquet
If we have 3000+ column and 10lac+ records in a file then which of the two ( ORC and Parquet) is more appropriate

MD · Answer 1 · May 8, 2020

Hi@Amey,

It depends on your use case. Both Parquet and ORC have their own advantages and disadvantages. If you go with Spark then parquet is better because Spark has a vectorized parquet reader and no vectorized ORC reader. But it also depends on "How nested is your Data and how many columns are there". They uses a hierarchal tree-like structure to store data. More the nesting deeper the tree.

But ORC is designed for a flattened file store. So if your Data is flattened with fewer columns, you can go with ORC, otherwise, parquet would be fine for you. Compression on flattened Data works amazingly in ORC.

Hope this will give you some idea.