What is Modeling data in Hadoop and how to do it

Question

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm using Sqoop to bring all these tables across, resulting in 10 directories containing csv files.

I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.

Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.

Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.

Frankie · Answer 1 · Sep 19, 2018

I suggest spending some time with Apache Avro.

With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...

It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.

It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.

Even this way it works fine..

Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.

Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.

What is Modeling data in Hadoop and how to do it

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Big Data Hadoop

Code was tested successfully in Dev and Test. When deployed to Productions it is failing. As an admin, how do you track the issue?

I have to ingest in hadoop cluster large number of files for testing , what is the best way to do it?

What is the command to start Job history server in Hadoop 2.x & how to get its UI?

What is -cp command in hadoop? How it works?

Hadoop Mapreduce word count Program

hadoop fs -put command?

Hadoop dfs -ls command?

Is there a way to copy data from one one Hadoop distributed file system(HDFS) to another HDFS?

What is the Data format and database choices in Hadoop and Spark?

What is Custom partitioner in Hadoop? How to write partition function ?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES