First understand Big Data and challenges associated with Big Data. So, that you can understand how Hadoop emerged as a solution to those Big Data problems. This What is Hadoop and Hadoop Tutorial blog will introduce you to that.
Then you should understand how Hadoop architecture works in respect of HDFS, YARN & MapReduce.
Further moving on you should install Hadoop on your system so that you can start working with Hadoop. This will help you in understanding the practical aspects in detail.
Advancing ahead take a deep dive into Hadoop Ecosystem and learn various tools inside Hadoop Ecosystem with their functionalities. So, that you will learn how to create a tailored solution according to your requirements.
Let us understand in brief:
What is Big Data?
Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications. The challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data.
Hadoop & it’s architecture
The main components of HDFS are NameNode and DataNode.
NameNode
It is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place in the file system metadata.
DataNode
These are slave daemons which run on each slave machine. The actual data is stored on DataNodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.
For processing, we use YARN(Yet Another Resource Negotiator). The components of YARN are ResourceManager and NodeManager.
ResourceManager
It is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.
NodeManager
It is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with ResourceManager to remain up-to-date.
So, you can perform parallel processing on HDFS using MapReduce.
MapReduce
It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a software framework which helps in writing applications that process large data sets using distributed and parallel algorithms inside Hadoop environment. In a MapReduce program, Map() and Reduce() are two functions.The Map function performs actions like filtering, grouping and sorting.While Reduce function aggregates and summarizes the result produced by map function.The result generated by the Map function is a key-value pair (K, V) which acts as the input for Reduce function.
Pig
PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment. You can better understand it as Java and JVM. It supports pig latin language.
Hive
Facebook created HIVE for people who are fluent in SQL. Thus, HIVE makes them feel at home while working in a Hadoop Ecosystem. Basically, HIVE is a data warehousing component which performs reading, writing and managing large data sets in a distributed environment using SQL-like interface.
HBase
HBase is an open source, non-relational distributed database. In other words, it is a NoSQL database. It supports all types of data and that is why it’s capable of handling anything and everything inside a Hadoop ecosystem. It is modelled after Google’s BigTable, which is a distributed storage system designed to cope up with large data sets
Hope this helps