flume twitter data file not generated in HDFS

+1 vote

I am using Apache Hadoop on Windows 10. I have installed Flume and trying to generate tweets using Twitter Agent. The twitter agent runs perfectly and the tweets are shown in the command prompt. But I can't see the twitter file generated in HDFS. 

Please help me through this. 

Sep 26, 2019 in Big Data Hadoop by chandanarora
• 130 points
732 views
Did you manage to figure this out?
Did you try the below-given solution?
I have the same problem and I already have source and sink binding and it is not working. Please help if you know the solution

Hi,

Maybe your cluster is in a safe mode. So check this first. But if you get the same output, then paste your complete code here.

1 answer to this question.

0 votes

Seems like you've missed the source and sink binding. Add these and then try:

TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel 
answered Dec 2, 2019 by Esha
I have the same problem and I already have this and it is still not working.

Hi,

Maybe your cluster is in a safe mode. So check this first. But if you get the same output, then paste your complete code here.

Hi, I checked and my safe mode is off. I am working on Windows, do you maybe think that is the problem? This is my flume.conf file:

TwitterAgent.sources=Twitter
TwitterAgent.channels=MemChannel
TwitterAgent.sinks=HDFS

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sources.Twitter.consumerKey =
TwitterAgent.sources.Twitter.consumerSecret =
TwitterAgent.sources.Twitter.accessToken =
TwitterAgent.sources.Twitter.accessTokenSecret =

TwitterAgent.sources.Twitter.keywords= big, data

TwitterAgent.sinks.HDFS.channel=MemChannel
TwitterAgent.sinks.HDFS.type=hdfs
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:9000/flume
TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream
TwitterAgent.sinks.HDFS.hdfs.writeformat=Text
TwitterAgent.sinks.HDFS.hdfs.batchSize= 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity=10000
TwitterAgent.channels.MemChannel.transactionCapacity=1000

Hi,

Your code looks good. Check the permission of /flume folder in your HDFS cluster. And match your execution command as shown below.

$ bin/flume-ng agent --conf ./conf/ -f conf/flume.conf Dflume.root.logger=DEBUG,console -n TwitterAgent

Hi, 

I gave 777 permission to the folder.

I am using this command because I get an error when I use that one that you posted and it works but it is still not storing data

flume-ng agent --conf ./conf/ -f conf/flume.conf.txt -property "flume.root.logger=DEBUG,console" -n TwitterAgent

Hi, 

I checked your code. There is nothing wrong. Try the below code. It worked earlier in my system. Though the code is the same maybe. Only I arranged the code.

# Naming the components on the current agent. 
TwitterAgent.sources = Twitter 
TwitterAgent.channels = MemChannel 
TwitterAgent.sinks = HDFS
  
# Describing/Configuring the source 
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = Your OAuth consumer key
TwitterAgent.sources.Twitter.consumerSecret = Your OAuth consumer secret 
TwitterAgent.sources.Twitter.accessToken = Your OAuth consumer key access token 
TwitterAgent.sources.Twitter.accessTokenSecret = Your OAuth consumer key access token secret 
TwitterAgent.sources.Twitter.keywords = tutorials point,java, bigdata, mapreduce, mahout, hbase, nosql
  
# Describing/Configuring the sink 

TwitterAgent.sinks.HDFS.type = hdfs 
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/Hadoop/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream 
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text 
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 
 
# Describing/Configuring the channel 
TwitterAgent.channels.MemChannel.type = memory 
TwitterAgent.channels.MemChannel.capacity = 10000 
TwitterAgent.channels.MemChannel.transactionCapacity = 100
  
# Binding the source and sink to the channel 
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel 

Now execute this code.

$ cd $FLUME_HOME
$ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf Dflume.root.logger=DEBUG,console -n TwitterAgent
Still not working :( but thank you for trying
It should work. Did you get any error or warning?
No, the tweets are fetching but not storing

It should work. Do one thing try to store your tweets in your default directory in HDFS Cluster.

Which one is the default?

P.S. I switch to Ubuntu on virtual machine and the same code is working there

Hi,

You may find the default directory name in your core-site-xml file. Yes, it should work in all the systems. There is nothing wrong with your code.

Hi, I have a new error. When I try to do select * from tweets; in hive I get this error 

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.StringReader#5a82bc58; line: 1, column: 2]
Time taken: 1.698 seconds
do you know maybe how to solve it?
Hi,

As you can see in your error, that there is a problem in your format. Your dataset has a different format. But you are trying to fetch the data in a different way. To fetch the data you have to follow the proper format.
Hi,

I followed the tutorial and I created table with this command:

CREATE EXTERNAL TABLE tweets ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING, userr:STRUCT<screen_name:STRING,name:STRING>>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, userr STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/root/flume/test';

My data should be in Json format and I included the hcatalog JAR , so it should work just fine according to tutorial. Can you tell me how to follow the proper format?

Hi,

Right your raw dataset is in JSON format. But when you created the table you followed some pattern. For example, you have used BIGINT for the id field. So when you try to fetch your dataset you have to use the same field in the same order. Otherwise, it will not work.

HI,

I fetched my data with Flume and store it in HDFS and then I just load data from HDFS into Hive table so I don't understand how should I determine in which order my data is fetched?

Hi,

You have created one schema when you created your table as you paste above. It is simple as when we work in SQL. Generally, in SQL we create one table with some schema and store our data. When we need to fetch the data we have to follow the same schema. Do one thing go through the document properly once that you followed and also try to analyze the dataset.

Hi,

this is my header of HDFS file:

{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]

I really tried everything and I can't seem to find where the problem is, if you could check it out and help me I would be thankful!
Ok. Let's do step by step. Are you able to create a table in the hive from your HDFS dataset? Also, paste the blog link that you have followed.

Hi,

I am able to create table and put data in it, but I cant do any query and it wont show me the data in table with SELECT * FROM tweets;

I followed edureka tutorial https://www.youtube.com/watch?v=1Ae4t0rjK3o&t=1633s

Ok. Can you share your command that you used to load your dataset into a hive table?
LOAD DATA INPATH '/flume' INTO TABLE tweets;

I guess here is the problem. Check the filename in HDFS. You should have multiple files. Maybe something like this /flume/flume.12345. So you have to give the file name. If you are not able to understand, then go to your HDFS. Go to your /flume folder and you will get some file names. Paste the screenshot here.

I did that too, I put the file names but it doesn't work either. I tried then just /flume to see what will happen and it is the same

It should work. Ok, try to load the local files from your system. And see what happens. You can see the below blog.

https://www.tutorialspoint.com/hive/hive_create_table.htm

I get the same error

Hi,

I really don't know why you are getting this error. It should work. You are able to create a table but not able to load the data in your table. Ok, open your file from HDFS and match the format with your table.

I matched the header format from HDFS file to table format and I still get the error
Hi,

If it is still not working. Then try in the Linux system and see if you get the same error or not.
Hi,

thank you but I am working on the Linux system and it is still not working.
Hi,

Ok, I will replicate your requirement in my own system and get back to you.
Hi,

did you try it?

Hi,

I tried to create a table in the hive. It is working fine. In your case, there is nothing wrong with the steps. But I think the problem is with the hive. You just create a simple table and try to import a simple text file from your local system or HDFS.

Related Questions In Big Data Hadoop

0 votes
1 answer

How can I append data to an existing file in HDFS?

You have to do some configurations as ...READ MORE

answered Jul 25, 2019 in Big Data Hadoop by ravikiran
• 4,620 points
4,125 views
0 votes
1 answer
0 votes
1 answer

How to upload file to HDFS in Ubuntu

you can use  hadoop fs -copyFromLocal  "/home/ritwi ...READ MORE

answered Apr 18, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
566 views
0 votes
1 answer

How to print the content of a file in console present in HDFS?

Yes, you can use hdfs dfs command ...READ MORE

answered Apr 19, 2018 in Big Data Hadoop by Shubham
• 13,480 points
2,825 views
0 votes
1 answer
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
7,072 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
1,138 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
51,056 views
0 votes
1 answer

How to execute python script in hadoop file system (hdfs)?

If you are simply looking to distribute ...READ MORE

answered Sep 19, 2018 in Big Data Hadoop by digger
• 26,700 points
7,541 views
0 votes
1 answer

How to write a file in hdfs with Java?

You could pass the URI when getting ...READ MORE

answered Sep 26, 2018 in Big Data Hadoop by digger
• 26,700 points
2,360 views