How to learn Big Data and Ecosystem ?

Question

I have just started exploring BigData technologies.

I'm new to Hadoop framework.

But, getting confused with so many ecosystem components and framework. Could you please advise to get a structured start for learning?

Which is the main framework in this Ecosystem?

kurt_cobain · Answer

First understand Big Data and challenges associated with Big Data. So, that you can understand how Hadoop emerged as a solution to those Big Data problems. This&#160;What is Hadoop&#160;and&#160;Hadoop Tutorial&#160;blog will introduce you to that.Then you should understand how Hadoop architecture works in respect of HDFS, YARN & MapReduce.Further moving on you should install Hadoop on your system so that you can start working with Hadoop. This will help you in understanding the practical aspects in detail.Advancing ahead take a deep dive into&#160;Hadoop Ecosystem&#160;and learn various tools inside Hadoop Ecosystem with their functionalities. So, that you will learn how to create a tailored solution according to your requirements.Let us understand in brief:What is Big Data?Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications. The challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data.Hadoop & it&#8217;s architectureThe main components of&#160;HDFS&#160;are&#160;NameNode&#160;and&#160;DataNode.NameNodeIt is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place in the file system metadata.DataNodeThese are slave daemons which run on each slave machine. The actual data is stored on DataNodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.For processing, we use YARN(Yet Another Resource Negotiator). The components of&#160;YARN&#160;are&#160;ResourceManager&#160;and&#160;NodeManager.ResourceManagerIt is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.NodeManagerIt is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with ResourceManager to remain up-to-date.So, you can perform parallel processing on HDFS using MapReduce.MapReduceIt is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a software framework which helps in writing applications that process large data sets using distributed and parallel algorithms inside Hadoop environment. In a MapReduce program,&#160;Map() and Reduce()&#160;are two functions.The&#160;Map function&#160;performs actions like filtering, grouping and sorting.While&#160;Reduce function&#160;aggregates and summarizes the result produced by map function.The result generated by the Map function is a key-value pair (K, V) which acts as the input for Reduce function.PigPIG has two parts:&#160;Pig Latin, the language and&#160;the pig runtime,&#160;for the execution environment. You can better understand it as Java and JVM. It supports&#160;pig latin&#160;language.HiveFacebook created HIVE for people who are fluent in SQL. Thus, HIVE makes them feel at home while working in a Hadoop Ecosystem. Basically, HIVE is a data warehousing component which performs reading, writing and managing large data sets in a distributed environment using SQL-like interface.HBaseHBase is an open source, non-relational distributed database. In other words, it is a NoSQL database. It supports all types of data and that is why it&#8217;s capable of handling anything and everything inside a Hadoop ecosystem. It is modelled after Google&#8217;s BigTable, which is a distributed storage system designed to cope up with large data setsHope this helps

How to learn Big Data and Ecosystem

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Big Data Hadoop

How to analyze block placement on datanodes and rebalancing data across Hadoop nodes?

What is Modeling data in Hadoop and how to do it?

How to create a parquet table in hive and store data in it from a hive table?

Explain to me how to transfer data between Azure tables and Hadoop on Azure

Hadoop dfs -ls command?

“no such file or directory" in case of hadoop fs -ls

Hadoop Mapreduce word count Program

hadoop.mapred vs hadoop.mapreduce?

How to find hadoop distribution and version?

How to import data to HBase from SQL server?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES