flume twitter data file not generated in HDFS

Question

I am using Apache Hadoop on Windows 10. I have installed Flume and trying to generate tweets using Twitter Agent. The twitter agent runs perfectly and the tweets are shown in the command prompt. But I can't see the twitter file generated in HDFS.

Please help me through this.

I have the same problem and I already have source and sink binding and it is not working. Please help if you know the solution — Aug 21, 2020
Hi,

Maybe your cluster is in a safe mode. So check this first. But if you get the same output, then paste your complete code here. — MD, Aug 21, 2020

Sirajul · Answer 1 · Dec 2, 2019

Seems like you've missed the source and sink binding. Add these and then try:

TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel

To know more about Big Data, It's recommended to join Data Architect Certification today.

answered Dec 2, 2019 by Esha

I have the same problem and I already have this and it is still not working.

commented Aug 21, 2020 by anonymous

reshown Aug 21, 2020 by Sirajul

Hi,

Maybe your cluster is in a safe mode. So check this first. But if you get the same output, then paste your complete code here.

commented Aug 21, 2020 by MD
• 95,460 points

Hi, I checked and my safe mode is off. I am working on Windows, do you maybe think that is the problem? This is my flume.conf file:

TwitterAgent.sources=Twitter
TwitterAgent.channels=MemChannel
TwitterAgent.sinks=HDFS

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sources.Twitter.consumerKey =
TwitterAgent.sources.Twitter.consumerSecret =
TwitterAgent.sources.Twitter.accessToken =
TwitterAgent.sources.Twitter.accessTokenSecret =

TwitterAgent.sources.Twitter.keywords= big, data

TwitterAgent.sinks.HDFS.channel=MemChannel
TwitterAgent.sinks.HDFS.type=hdfs
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:9000/flume
TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream
TwitterAgent.sinks.HDFS.hdfs.writeformat=Text
TwitterAgent.sinks.HDFS.hdfs.batchSize= 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity=10000
TwitterAgent.channels.MemChannel.transactionCapacity=1000

commented Aug 21, 2020 by anonymous

edited Aug 21, 2020 by MD

Hi,

Your code looks good. Check the permission of /flume folder in your HDFS cluster. And match your execution command as shown below.

$ bin/flume-ng agent --conf ./conf/ -f conf/flume.conf Dflume.root.logger=DEBUG,console -n TwitterAgent

commented Aug 21, 2020 by MD
• 95,460 points

Hi,

I gave 777 permission to the folder.

I am using this command because I get an error when I use that one that you posted and it works but it is still not storing data

flume-ng agent --conf ./conf/ -f conf/flume.conf.txt -property "flume.root.logger=DEBUG,console" -n TwitterAgent

commented Aug 21, 2020 by anonymous

Hi,

I checked your code. There is nothing wrong. Try the below code. It worked earlier in my system. Though the code is the same maybe. Only I arranged the code.

# Naming the components on the current agent. 
TwitterAgent.sources = Twitter 
TwitterAgent.channels = MemChannel 
TwitterAgent.sinks = HDFS
  
# Describing/Configuring the source 
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = Your OAuth consumer key
TwitterAgent.sources.Twitter.consumerSecret = Your OAuth consumer secret 
TwitterAgent.sources.Twitter.accessToken = Your OAuth consumer key access token 
TwitterAgent.sources.Twitter.accessTokenSecret = Your OAuth consumer key access token secret 
TwitterAgent.sources.Twitter.keywords = tutorials point,java, bigdata, mapreduce, mahout, hbase, nosql
  
# Describing/Configuring the sink 

TwitterAgent.sinks.HDFS.type = hdfs 
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/Hadoop/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream 
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text 
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 
 
# Describing/Configuring the channel 
TwitterAgent.channels.MemChannel.type = memory 
TwitterAgent.channels.MemChannel.capacity = 10000 
TwitterAgent.channels.MemChannel.transactionCapacity = 100
  
# Binding the source and sink to the channel 
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel

Now execute this code.

$ cd $FLUME_HOME
$ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf Dflume.root.logger=DEBUG,console -n TwitterAgent

commented Aug 21, 2020 by MD
• 95,460 points

Still not working :( but thank you for trying

commented Aug 21, 2020 by anonymous

It should work. Did you get any error or warning?

commented Aug 21, 2020 by MD
• 95,460 points

No, the tweets are fetching but not storing

commented Aug 22, 2020 by anonymous

It should work. Do one thing try to store your tweets in your default directory in HDFS Cluster.

commented Aug 23, 2020 by MD
• 95,460 points

Which one is the default?

P.S. I switch to Ubuntu on virtual machine and the same code is working there

commented Aug 25, 2020 by anonymous

Hi,

You may find the default directory name in your core-site-xml file. Yes, it should work in all the systems. There is nothing wrong with your code.

commented Aug 25, 2020 by MD
• 95,460 points

Hi, I have a new error. When I try to do select * from tweets; in hive I get this error

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.StringReader#5a82bc58; line: 1, column: 2]
Time taken: 1.698 seconds

do you know maybe how to solve it?

commented Aug 31, 2020 by anonymous

Hi,

As you can see in your error, that there is a problem in your format. Your dataset has a different format. But you are trying to fetch the data in a different way. To fetch the data you have to follow the proper format.

commented Sep 1, 2020 by MD
• 95,460 points

Hi,

I followed the tutorial and I created table with this command:

CREATE EXTERNAL TABLE tweets ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING, userr:STRUCT<screen_name:STRING,name:STRING>>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, userr STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/flume/test';

My data should be in Json format and I included the hcatalog JAR , so it should work just fine according to tutorial. Can you tell me how to follow the proper format?

commented Sep 1, 2020 by anonymous

Hi,

Right your raw dataset is in JSON format. But when you created the table you followed some pattern. For example, you have used BIGINT for the id field. So when you try to fetch your dataset you have to use the same field in the same order. Otherwise, it will not work.

commented Sep 1, 2020 by MD
• 95,460 points

HI,

I fetched my data with Flume and store it in HDFS and then I just load data from HDFS into Hive table so I don't understand how should I determine in which order my data is fetched?

commented Sep 2, 2020 by anonymous

Hi,

You have created one schema when you created your table as you paste above. It is simple as when we work in SQL. Generally, in SQL we create one table with some schema and store our data. When we need to fetch the data we have to follow the same schema. Do one thing go through the document properly once that you followed and also try to analyze the dataset.

commented Sep 2, 2020 by MD
• 95,460 points

Hi,

this is my header of HDFS file:

{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]

I really tried everything and I can't seem to find where the problem is, if you could check it out and help me I would be thankful!

commented Sep 2, 2020 by anonymous

Ok. Let's do step by step. Are you able to create a table in the hive from your HDFS dataset? Also, paste the blog link that you have followed.

commented Sep 2, 2020 by MD
• 95,460 points

Hi,

I am able to create table and put data in it, but I cant do any query and it wont show me the data in table with SELECT * FROM tweets;

I followed edureka tutorial https://www.youtube.com/watch?v=1Ae4t0rjK3o&t=1633s

commented Sep 2, 2020 by anonymous

Ok. Can you share your command that you used to load your dataset into a hive table?

commented Sep 2, 2020 by MD
• 95,460 points

LOAD DATA INPATH '/flume' INTO TABLE tweets;

commented Sep 2, 2020 by anonymous

I guess here is the problem. Check the filename in HDFS. You should have multiple files. Maybe something like this /flume/flume.12345. So you have to give the file name. If you are not able to understand, then go to your HDFS. Go to your /flume folder and you will get some file names. Paste the screenshot here.

commented Sep 3, 2020 by MD
• 95,460 points

I did that too, I put the file names but it doesn't work either. I tried then just /flume to see what will happen and it is the same

commented Sep 3, 2020 by anonymous

It should work. Ok, try to load the local files from your system.

commented Sep 3, 2020 by MD
• 95,460 points

I get the same error

commented Sep 4, 2020 by anonymous

Hi,

I really don't know why you are getting this error. It should work. You are able to create a table but not able to load the data in your table. Ok, open your file from HDFS and match the format with your table.

commented Sep 4, 2020 by MD
• 95,460 points

I matched the header format from HDFS file to table format and I still get the error

commented Sep 5, 2020 by anonymous

Hi,

If it is still not working. Then try in the Linux system and see if you get the same error or not.

commented Sep 7, 2020 by MD
• 95,460 points

Hi,

thank you but I am working on the Linux system and it is still not working.

commented Sep 10, 2020 by anonymous

Hi,

Ok, I will replicate your requirement in my own system and get back to you.

commented Sep 11, 2020 by MD
• 95,460 points

Hi,

did you try it?

commented Sep 14, 2020 by anonymous

Hi,

I tried to create a table in the hive. It is working fine. In your case, there is nothing wrong with the steps. But I think the problem is with the hive. You just create a simple table and try to import a simple text file from your local system or HDFS.