Big Data Analytics: BigQuery, Impala, and Drill

Become a Certified Professional

In previous post, we discussed Apache Hive, which first brought SQL to Hadoop. There are actually several SQL on Hadoop solutions competing with Hive head-to-head. Today, we will look into Google BigQuery, Cloudera Impala and Apache Drill, which all have a root to Google Dremel that was designed for interactive analysis of web-scale datasets. In a nutshell, they are native massively parallel processing query engine on read-only data.

Google BigQuery is the public implementation of Dremel. BigQuery provides the core set of features available in Dremel to third party developers via a REST API. Impala is Cloudera’s open source SQL query engine that runs on Hadoop. It is modeled after Dremel and is Apache-licensed. Impala became generally available in May 2013. Drill is another open source project inspired by Dremel and is still incubating at Apache. Both Impala and Drill can query Hive tables directly. Impala actually uses Hive’s metastore.

Hive is basically a front end to parse SQL statements, generate and optimize logical plans, translate them into physical plans that are finally executed by a backend such as MapReduce or Tez. Dremel and its derivatives are different as they execute queries natively without translating them into MapReduce jobs. For example, the core Impala component is a daemon process that runs on each node of the cluster as the query planner, coordinator, and execution engine. Each node can accept queries. The planner turns a request into collections of parallel plan fragments. The coordinator initiates execution on remote nodes in the cluster. The execution engine reads and writes to data files, and transmits intermediate query results back to the coordinator node.

The two core technologies of Dremel are columnar storage for nested data and the tree architecture for query execution:

Columnar Storage

Data is stored in a columnar storage fashion to achieve very high compression ratio and scan throughput.

Tree Architecture

The architecture forms a massively parallel distributed multi-level serving tree for pushing down a query to the tree and then aggregating the results from the leaves.

These are good ideas and have been adopted by other systems. For example, Hive 0.13 has the ORC file for columnar storage and can use Tez as the execution engine that structures the computation as a directed acyclic graph. Both (and other innovations) help a lot to improve the performance of Hive. However, the benchmark from Cloudera (the vendor of Impala) and the benchmark by AMPLab show that Impala still has the performance lead over Hive. It is well known that benchmarks are often biased due to the hardware setting, software tweaks, queries in testing, etc. But it is still meaningful to find out what possible design choice and implementation details cause this performance difference. And it may help both communities improve the offerings in the future. What follows is a list of possible reasons:

As a native query engine, Impala avoids the startup overhead of MapReduce/Tez jobs. It is well known that MapReduce programs take some time before all nodes are running at full capacity. In Hive, every query suffers this “cold start” problem. In contrast, Impala daemon processes are started at boot time, and thus are always ready to execute a query.
Hadoop reuses JVM instances to reduce the startup overhead partially. However, it also introduces another problem. The nodes in the Cloudera benchmark have 384 GB memory. Such a big heap is actually a big challenge to the garbage collection system of the reused JVM instances. The stop-of-the-world GC pauses may add high latency to queries. On the other hand, Impala prefers such large memory.
Impala process are multithreaded. Importantly, the scanning portion of plan fragments are multithreaded on SSD as well as making use of SSE4.2 instructions. The I/O and network systems are also highly multithreaded. Therefore, each single Impala node runs more efficiently by a high level local parallelism.
Impala’s query execution is pipelined as much as possible. In case of aggregation, the coordinator starts the final aggregation as soon as the pre-aggregation fragments has started to return results. In contrast, sort and reduce can only start once all the mappers are done in MapReduce. Tez currently doesn’t support pipelined execution yet.
MapReduce materializes all intermediate results. This feature enables better scalability and fault tolerance. However, it also significantly slows down the data processing. In contrast, Impala streams intermediate results between executors (of course, in tradeoff of the scalability). Tez allows different types of Input/Output including file, TCP, etc. But it seems that Hive doesn’t use this feature yet to avoid unnecessary disk writes.
The reducer of MapReduce employs a pull model to get Map output partitions. For sorted output, Tez makes use of the MapReduce ShuffleHandler, which requires downstream Inputs to pull data over HTTP. With multiple reducers (or downstream Inputs) running simultaneously, it is highly likely that some of them will attempt to read from the same map node at the same time, inducing a large number of disk seeks and slowing the effective disk transfer rate.
Hive’s query expressions are generated at compile time while Impala does run-time code generation for “big loops” using llvm that can achieve more optimized code.
Tez allows complete control over the processing, e.g. stopping processing when limits are met. It is very useful for top-k calculation and straggler handling. Unfortunately, this feature is not used by Hive currently. BTW, Dremel calculates approximate results for top-k and count-distinct using one-pass algorithms. It is not clear if Impala does the same.
During query execution, Dremel computes a histogram of tablet processing time. If a tablet takes a disproportionately long time to process, it is rescheduled to another server. If trading speed against accuracy is acceptable, Dremel can return the results before scanning all the data, which may reduce the response time significantly as a small fraction of the tables often take a lot longer. It is not clear if Impala implements a similar mechanism although straggler handling was stated on the roadmap.

As you see, some of these reasons are actually about the MapReduce or Tez. With the continuous improvements of MapReduce and Tez, Hive may avoid these problems in the future. Besides, the last two are the features of Dremel and it is not clear if Impala implements them.

In summary, Dremel and its derivatives provide us an inexpensive way to do interactive big data analytics. The Hadoop ecosystem is now a real threat to the traditional relational MPP data warehouse systems. The benchmark by AMPLab shows that Amazon Redshift (based on ParAccel by Actian) still has the performance lead over Impala but the gap is small. With continuous improvements (e.g. both Hive and Impala are working on cost based plan optimizer), we can expect SQL on Hadoop/HDFS at higher level in near feature.

This blog was originally published at haifengl.wordpress.com/2015/01/06/big-data-analytics-tez/

Also, Edureka has a specially curated Data Analyst Course that will make you proficient in tools and systems used by Data Analytics Professionals. It includes in-depth training on Statistics, Data Analytics with R, SAS, and Tableau. The curriculum has been determined by extensive research on 5000+ job descriptions across the globe.

Got a question for us? Please mention it in the comments section and we will get back to you.

Upcoming Batches For Data Analyst Certification Course

Course Name	Date	Details
Data Analyst Certification Course	Class Starts on 27th April,2024 27th April SAT&SUN (Weekend Batch)	View Details

Course Name

Date

Details

Data Analyst Certification Course

Class Starts on 27th April,2024

27th April

SAT&SUN (Weekend Batch)

Big Data Analytics: BigQuery, Impala, and Drill

Columnar Storage

Tree Architecture

Recommended videos for you

Python Programming – Learn Python Programming From Scratch

Python List, Tuple, String, Set And Dictonary – Python Sequences

Business Analytics with R

Application of Clustering in Data Science Using Real-Time Examples

Introduction to Business Analytics with R

Data Science : Make Smarter Business Decisions

Python Numpy Tutorial – Arrays In Python

Mastering Python : An Excellent tool for Web Scraping and Data Analysis

Python Classes – Python Programming Tutorial

Business Analytics Decision Tree in R

Python Loops – While, For and Nested Loops in Python Programming

Know The Science Behind Product Recommendation With R Programming

Machine Learning with Python

Web Scraping And Analytics With Python

Linear Regression With R

Python Tutorial – All You Need To Know In Python Programming

Diversity Of Python Programming

3 Scenarios Where Predictive Analytics is a Must

The Whys and Hows of Predictive Modeling-II

Python for Big Data Analytics

Recommended blogs for you

How To Best Implement Multiprocessing In Python?

Introduction To Supervised Learning

How to Implement Membership Operators in Python

Everything You Need To Know About Hash In Python

ClickStream Data for Analytics

Data Analyst vs Data Engineer vs Data Scientist: Skills, Responsibilities, Salary

Top Python Projects You Should Consider Learning

Init In Python: Everything You Need To Know

R Training-First Step to Become a Data Scientist

Map, Filter and Reduce Functions in Python: All you need to know

Arrays in Python – What are Python Arrays and how to use them?

What are Comments in Python and how to use them?

What is Python JSON and How to implement it?

Python Functions : A Complete Beginners Guide

4 Ways To Use R And Hadoop Together

Loops In Python: Why Should You Use One?

R Shiny Tutorial: All you Need to Know

Top 10 Machine Learning Frameworks You Need to Know

SAS Programming – Learn How To Code In SAS!

Data Science Career Opportunities: Your Guide To Unlocking Top Data Scientist Jobs

Join the discussion Cancel reply

Trending Courses in Data Science

Data Science and Machine Learning Internship ...

Python Programming Certification Course

Data Science with Python Certification Course

Statistics Essentials for Analytics

SAS Training and Certification

Data Science with R Programming Certification ...

Data Analytics with R Programming Certificati ...

Analytics for Retail Banks

Decision Tree Modeling Using R Certification ...

Advanced Predictive Modelling in R Certificat ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Big Data Analytics: BigQuery, Impala, and Drill