DP 203: Data Engineering on Microsoft Azure
- 7k Enrolled Learners
- Live Class
Looking out for Apache HBase Interview Questions that are frequently asked by employers? Here is the blog on Apache HBase interview questions in Hadoop Interview Questions series. I hope you must not have missed the earlier blogs of our Hadoop Interview Question series.
After going through the HBase interview questions, you will get an in-depth knowledge of questions that are frequently asked by employers in Hadoop interviews related to HBase. This will definitely help you to kickstart your career as a Big Data Engineer and become a Big Data certified professional.
In case you have attended any HBase interview previously, we encourage you to add your questions in the comments tab. We will be happy to answer them, and spread the word to the community of fellow job seekers.
Now moving on, let us look at the Apache HBase interview questions.
The key components of HBase are Zookeeper, RegionServer and HBase Master.
|Region Server||A table can be divided into several regions. A group of regions is served to the clients by a Region Server|
|HMaster||It coordinates and manages the Region Servers (similar as NameNode manages DataNodes in HDFS).|
|ZooKeeper||Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.|
get() method is used to read the data from the table.
Apache Hive is a data warehousing infrastructure built on top of Hadoop. It helps in querying data stored in HDFS for analysis using Hive Query Language (HQL), which is a SQL-like language, that gets translated into MapReduce jobs. Hive performs batch processing on Hadoop.
Apache HBase is NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase partitions the tables, and the tables are further splitted into column families.
Hive and HBase are two different Hadoop based technologies – Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database of Hadoop. We can use them together. Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from HBase to Hive and vice-versa.
HBase comprises of:
Column Family is a collection of columns, whereas row is a collection of column families.
It is a default mode of HBase. In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.
It is useful to modify, or extend, the behavior of a filter to gain additional control over the returned data. These types of filters are known as decorating filter. It includes SkipFilter and WhileMatchFilter.
A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
Data Manipulation commands of HBase are:
Following code is used to open a HBase connection, here users is my HBase table:
Configuration myConf = HBaseConfiguration.create(); HTable table = new HTable(myConf, “users”);
It is used to disable, drop and recreate the specified tables.
♣ Tip: To delete table first disable it, then delete it.
Once you issue a delete command in HBase for cell, column or column family, it is not deleted instantly. A tombstone marker in inserted. Tombstone is a specified data, which is stored along with standard data. This tombstone makes hides all the deleted data.
The actual data is deleted at the time of major compaction. In Major compaction, HBase merges and recommits the smaller HFiles of a region to a new HFile. In this process, the same column families are placed together in the new HFile. It drops deleted and expired cell in this process. All the results from scan and get filters the deleted cells.
There are three types of tombstone markers in HBase:
The blocksize is configured per column family and the default value is 64 KB. This value can be changed as per requirements.
./bin/hbase shell command is used to run the HBase shell. Execute this command in HBase directory.
whoami command is used to show HBase user.
MSLAB stands for Memstore-Local Allocation Buffer. Whenever a request thread needs to insert data into a MemStore, it doesn’t allocates the space for that data from the heap at large, but rather allocates memory arena dedicated to the target region.
Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that focuses on decompression speed.
HBase comes with a tool called hbck which is implemented by the HBaseFsck class. HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems and repairing a corrupted HBase. It works in two basic modes – a read-only inconsistency identifying mode and a multi-phase read-write repair mode.
Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources. It also provides support for different message formats, offering many choices for a client application to communicate with the server.
Apache Thrift is written in C++, but provides schema compilers for many programming languages, including Java, C++, Perl, PHP, Python, Ruby, and more.
Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status. It polls current metrics on a regular basis and compares them with given thresholds.
The ZooKeeper is used to maintain the configuration information and communication between region servers and clients. It also provides distributed synchronization. It helps in maintaining server state inside the cluster by communicating through sessions.
Every Region Server along with HMaster Server sends continuous heartbeat at regular interval to Zookeeper and it checks which server is alive and available. It also provides server failure notifications so that, recovery measures can be executed.
Catalog tables are used to maintain the metadata information.
HBase combines HFiles to reduce the storage and reduce the number of disk seeks needed for a read. This process is called compaction. Compaction chooses some HFiles from a region and combines them. There are two types of compactions.
HColumnDescriptor stores the information about a column family like compression settings, number of versions etc. It is used as input when creating a table or adding a column.
PageFilter accepts the pagesize as the parameter. Implementation of Filter interface that limits results to a specific page size. It terminates scanning once the number of filter-passed the rows greater than the given page size.
Syntax: PageFilter (<page_size>)
HBase schemas can be created or updated using the Apache HBase Shell or by using Admin in the Java API.
Creating table schema:
Configuration config = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); // execute command through admin</span></pre> // Instantiating table descriptor class HTableDescriptor t1 = new HTableDescriptor(TableName.valueOf("employee")); // Adding column families to t1 t1.addFamily(new HColumnDescriptor("professional")); t1.addFamily(new HColumnDescriptor("personal")); // Create the table through admin admin.createTable(t1);
♣ Tip: Tables must be disabled when making ColumnFamily modifications.
String table = “myTable”; admin.disableTable(table); admin.modifyColumn(table, cf2); // modifying existing ColumnFamily admin.enableTable(table);
The filters that are supported by HBase are:
There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. Each approach has benefits and limitation.
Full Shutdown Backup
Some environments can tolerate a periodic full shutdown of their HBase cluster, for example, if it is being used as a back-end process and not serving front-end webpages.
Live Cluster Backup
The environments which cannot handle downtime uses Live Cluster Backup.
Failures are common in large distributed systems, and HBase is no exception.
If the server hosting a MemStore that has not yet been flushed crashes. The data that was in memory, but not yet persisted are lost. HBase safeguards against that by writing to the WAL before the write completes. Every server that’s part of the.
HBase cluster keeps a WAL to record changes as they happen. The WAL is a file on the underlying file system. A write isn’t considered successful until the new WAL entry is successfully written. This guarantee makes HBase as durable as the file system backing it. Most of the time, HBase is backed by the Hadoop Distributed Filesystem (HDFS). If HBase goes down, the data that were not yet flushed from the MemStore to the HFile can be recovered by replaying the WAL.
The read process will go through the following process sequentially:
In addition to being a schema-less database, HBase is also versioned.
Every time you perform an operation on a cell, HBase implicitly stores a new version. Creating, modifying and deleting a cell are all treated identically, they are all new versions. When a cell exceeds the maximum number of versions, the extra records are dropped during the major compaction.
Instead of deleting an entire cell, you can operate on a specific version within that cell. Values within a cell are versioned and it is identified the timestamp. If a version is not mentioned, then the current timestamp is used to retrieve the version. The default number of cell version is three.
HBase supports Bloom Filter to improve the overall throughput of the cluster. A HBase Bloom Filter is a space efficient mechanism to test whether a HFile contains a specific row or row-col cell.
Without Bloom Filter, the only way to decide if a row key is present in a HFile is to check the HFile’s block index, which stores the start row key of each block in the HFile. There are many rows drops between the two start keys. So, HBase has to load the block and scan the block’s keys to figure out if that row key actually exists.
I hope these Apache HBase Interview Questions were helpful for you. This is just a part of our Hadoop Interview Question series. Kindly, refer to the links given below and enjoy the reading:
Got a question for us? Mention them in the comments section and we will get back to you.
|Big Data Hadoop Certification Training Course|
Class Starts on 24th February,2024
24th FebruarySAT&SUN (Weekend Batch)