How to learn Big Data and Ecosystem

0 votes
I have just started exploring BigData technologies.

I'm new to Hadoop framework.

But, getting confused with so many ecosystem components and framework. Could you please advise to get a structured start for learning?

Which is the main framework in this Ecosystem?
Mar 27, 2018 in Big Data Hadoop by Ashish
• 2,650 points
984 views

1 answer to this question.

0 votes

First understand Big Data and challenges associated with Big Data. So, that you can understand how Hadoop emerged as a solution to those Big Data problems. This What is Hadoop and Hadoop Tutorial blog will introduce you to that.

Then you should understand how Hadoop architecture works in respect of HDFS, YARN & MapReduce.

Further moving on you should install Hadoop on your system so that you can start working with Hadoop. This will help you in understanding the practical aspects in detail.

Advancing ahead take a deep dive into Hadoop Ecosystem and learn various tools inside Hadoop Ecosystem with their functionalities. So, that you will learn how to create a tailored solution according to your requirements.

Let us understand in brief:

What is Big Data?

Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications. The challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data.

Hadoop & it’s architecture

The main components of HDFS are NameNode and DataNode.

NameNode

It is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place in the file system metadata.

DataNode

These are slave daemons which run on each slave machine. The actual data is stored on DataNodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.

For processing, we use YARN(Yet Another Resource Negotiator). The components of YARN are ResourceManager and NodeManager.

ResourceManager

It is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.

NodeManager

It is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with ResourceManager to remain up-to-date.

So, you can perform parallel processing on HDFS using MapReduce.

MapReduce

It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a software framework which helps in writing applications that process large data sets using distributed and parallel algorithms inside Hadoop environment. In a MapReduce program, Map() and Reduce() are two functions.The Map function performs actions like filtering, grouping and sorting.While Reduce function aggregates and summarizes the result produced by map function.The result generated by the Map function is a key-value pair (K, V) which acts as the input for Reduce function.

Pig

PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment. You can better understand it as Java and JVM. It supports pig latin language.

Hive

Facebook created HIVE for people who are fluent in SQL. Thus, HIVE makes them feel at home while working in a Hadoop Ecosystem. Basically, HIVE is a data warehousing component which performs reading, writing and managing large data sets in a distributed environment using SQL-like interface.

HBase

HBase is an open source, non-relational distributed database. In other words, it is a NoSQL database. It supports all types of data and that is why it’s capable of handling anything and everything inside a Hadoop ecosystem. It is modelled after Google’s BigTable, which is a distributed storage system designed to cope up with large data sets

Hope this helps

answered Mar 27, 2018 by kurt_cobain
• 9,350 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How to analyze block placement on datanodes and rebalancing data across Hadoop nodes?

HDFS provides a tool for administrators i.e. ...READ MORE

answered Jun 21, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
990 views
0 votes
1 answer

What is Modeling data in Hadoop and how to do it?

I suggest spending some time with Apache ...READ MORE

answered Sep 19, 2018 in Big Data Hadoop by Frankie
• 9,830 points
1,793 views
0 votes
1 answer

How to create a parquet table in hive and store data in it from a hive table?

Please use the code attached below for ...READ MORE

answered Jan 28, 2019 in Big Data Hadoop by Omkar
• 69,220 points
18,692 views
0 votes
1 answer

Explain to me how to transfer data between Azure tables and Hadoop on Azure

I shall redirect you to a link ...READ MORE

answered Jul 4, 2019 in Big Data Hadoop by ravikiran
• 4,620 points
730 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
4,624 views
0 votes
1 answer

“no such file or directory" in case of hadoop fs -ls

The behaviour that you are seeing is ...READ MORE

answered May 9, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points

edited May 9, 2018 by nitinrawat895 8,249 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,053 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,555 views
0 votes
1 answer

How to find hadoop distribution and version?

Just Use the command Hadoop version ...READ MORE

answered Apr 6, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points

edited Apr 6, 2018 by kurt_cobain 2,019 views
0 votes
1 answer

How to import data to HBase from SQL server?

You can easily import the data from ...READ MORE

answered Apr 20, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
1,642 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP