Is fetching data from apache flume webcrawling

0 votes
Hi team,

I would like to understand while we are fetching data from social media websites using apache flume, isn't it same as web crawling?
Jul 11, 2019 in Apache Spark by Karan
918 views

1 answer to this question.

0 votes
Web crawling is a program or automated script which browses the websites mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches whereas Flume is a special-purpose tool designed to send data to HDFS and HBase. It has specific optimizations for HDFS and it integrates with Hadoop’s security.

Flume has a simple event-driven pipeline architecture with 3 important roles-Source, Channel and Sink.

-->Source defines where the data is coming from, for instance, a message queue or a file.

-->Sinks defined the destination of the data pipelined from various sources.

-->Channels are pipes which establish connections between sources and sinks.

where source can be any API or repositories such as the one provided by twitter, facebook, youtube, etc and sink is the place where you want to store the data such as HDFS/hive warehouse, etc.

The concept to both these technologies are quite similar but still, they are used for different purposes like flume only details all kinds of logs stored and web scrapping deals with scrapped website data.
answered Jul 11, 2019 by Esha

Related Questions In Apache Spark

0 votes
1 answer

How is Apache Spark different from the Hadoop approach?

In Hadoop MapReduce the input data is ...READ MORE

answered May 7, 2018 in Apache Spark by BD Master
1,164 views
0 votes
1 answer

What is the difference between Apache Spark SQLContext vs HiveContext?

Spark 2.0+ Spark 2.0 provides native window functions ...READ MORE

answered May 26, 2018 in Apache Spark by nitinrawat895
• 11,380 points
4,615 views
+1 vote
1 answer

getting null values in spark dataframe while reading data from hbase

Can you share the screenshots for the ...READ MORE

answered Jul 31, 2018 in Apache Spark by kurt_cobain
• 9,350 points
2,334 views
0 votes
1 answer

How to disable executor from fetching file from cache?

When a Spark application is running, the ...READ MORE

answered Mar 10, 2019 in Apache Spark by Siri
2,509 views
+1 vote
2 answers
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,076 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,573 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
109,065 views
0 votes
1 answer

How is RDD in Spark different from Distributed Storage Management? Can anyone help me with this ?

Some of the key differences between an RDD and ...READ MORE

answered Jul 26, 2018 in Apache Spark by zombie
• 3,790 points
1,563 views
+1 vote
3 answers

What is the difference between rdd and dataframes in Apache Spark ?

Comparison between Spark RDD vs DataFrame 1. Release ...READ MORE

answered Aug 28, 2018 in Apache Spark by shams
• 3,670 points
43,121 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP