Is fetching data from apache flume webcrawling?

0 votes
Hi team,

I would like to understand while we are fetching data from social media websites using apache flume, isn't it same as web crawling?
Jul 11 in Apache Spark by Karan
39 views

1 answer to this question.

0 votes
Web crawling is a program or automated script which browses the websites mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches whereas Flume is a special-purpose tool designed to send data to HDFS and HBase. It has specific optimizations for HDFS and it integrates with Hadoop’s security.

Flume has a simple event-driven pipeline architecture with 3 important roles-Source, Channel and Sink.

-->Source defines where the data is coming from, for instance, a message queue or a file.

-->Sinks defined the destination of the data pipelined from various sources.

-->Channels are pipes which establish connections between sources and sinks.

where source can be any API or repositories such as the one provided by twitter, facebook, youtube, etc and sink is the place where you want to store the data such as HDFS/hive warehouse, etc.

The concept to both these technologies are quite similar but still, they are used for different purposes like flume only details all kinds of logs stored and web scrapping deals with scrapped website data.
answered Jul 11 by Esha

Related Questions In Apache Spark

0 votes
1 answer

How is Apache Spark different from the Hadoop approach?

In Hadoop MapReduce the input data is ...READ MORE

answered May 7, 2018 in Apache Spark by BD Master
99 views
0 votes
1 answer

What is the difference between Apache Spark SQLContext vs HiveContext?

Spark 2.0+ Spark 2.0 provides native window functions ...READ MORE

answered May 25, 2018 in Apache Spark by nitinrawat895
• 10,730 points
2,130 views
+1 vote
1 answer

getting null values in spark dataframe while reading data from hbase

Can you share the screenshots for the ...READ MORE

answered Jul 31, 2018 in Apache Spark by kurt_cobain
• 9,260 points
453 views
0 votes
1 answer

How to disable executor from fetching file from cache?

When a Spark application is running, the ...READ MORE

answered Mar 10 in Apache Spark by Siri
72 views
0 votes
1 answer
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,730 points
3,369 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,730 points
405 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
16,719 views
0 votes
1 answer

How is RDD in Spark different from Distributed Storage Management? Can anyone help me with this ?

Some of the key differences between an RDD and ...READ MORE

answered Jul 26, 2018 in Apache Spark by zombie
• 3,690 points
186 views
+1 vote
3 answers

What is the difference between rdd and dataframes in Apache Spark ?

Comparison between Spark RDD vs DataFrame 1. Release ...READ MORE

answered Aug 27, 2018 in Apache Spark by shams
• 3,580 points
17,145 views