What are all the Data quality checks we do in our real time Big Data projects.

0 votes
What are all the Data quality checks we do in our real time Bigdata projects.
Example1: How can we find out the count of records loaded in hdfs and source are same?
Example2: How can we know the loaded records in hdfs are proper?
Sep 4 in Big Data Hadoop by Madhan
• 120 points
53 views

1 answer to this question.

0 votes

You can use a checksum to compare the file in the source and the file uploaded on the hdfs. 

Try this: 

$ hdfs dfs -cat /file/in/hdfs | md5sum

$ hdfs dfs -cat /file/at/source | md5sum

If these two commands return the same value, then the file is not corrupted. 

answered Sep 4 by Tina
Thanks for the help, but i am copying from mysql table to hdfs in that scenario, if one record corrupted then how can we know that?

I'm not a 100% sure but I think you can use crc32 checksum as follows:

For the mysql table, use the below command to get the checksum:

CHECKSUM TABLE <tablename>;

And in hdfs, use this:

hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /path/to/file

Related Questions In Big Data Hadoop

0 votes
1 answer

What is the use of Apache Kafka in a Big Data Cluster?

Kafka is a Distributed Messaging System which ...READ MORE

answered Jun 21 in Big Data Hadoop by ravikiran
• 4,560 points
28 views
0 votes
1 answer

What are the extra files we need to run when we run a Hive action in Oozie?

Hi, Below are the extra files that need ...READ MORE

answered Jun 25 in Big Data Hadoop by Gitika
• 25,360 points
54 views
0 votes
1 answer

What do we exactly mean by “Hadoop” – the definition of Hadoop?

The official definition of Apache Hadoop given ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by Shubham
229 views
0 votes
1 answer

Is Kafka and Zookeeper are required in a Big Data Cluster?

Apache Kafka is one of the components ...READ MORE

answered Mar 22, 2018 in Big Data Hadoop by nitinrawat895
• 10,730 points
438 views
0 votes
1 answer
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,730 points
3,361 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,730 points
402 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
16,656 views
0 votes
5 answers
0 votes
1 answer

What are the different relational operations in “Pig Latin” you worked with?

Different relational operators are: for each order by filt ...READ MORE

answered Dec 14, 2018 in Big Data Hadoop by Frankie
• 9,810 points
27 views