Hive has been one of the preferred tools for performing queries on large datasets, especially when the full table scan is done on the datasets.
In the case of tables which are not partitioned, all the files in a table’s data directory are read and then filters are applied to it as a subsequent phase. This becomes a slow and expensive affair especially in cases of large tables.
Without partitioning, Hive reads all the data in the directory and applies the query filters on it. This is slow and expensive since all data has to be read.
Very often users need to filter the data on specific column values. To apply the partitioning in hive, users need to understand the domain of the data on which they are doing the analysis.
With this knowledge, identification of the frequently queried or accessed columns becomes easy and then partitioning feature of Hive can be applied on the selected columns.
Owing to the fact that Partitions are horizontal slices of data, larger sets of data can be separated into more manageable chunks.
When to use hive partitioning:
When any user wants data contained within a table to be split across multiple sections in hive table, use of partition is suggested.
The entries for the various columns of the dataset are segregated and stored in their respective partition. When we write the query to fetch the values from the table, only the required partitions of the table are queried, which reduces the time taken by the query to yield the result.