How will you choose various file formats for storing and processing data using Apache Hadoop ?

0 votes
Can someone help me with this as I am new to Hadoop and now I am totally confused?
Sep 27, 2018 in Big Data Hadoop by shams
• 3,580 points
51 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

The decision to choose a particular file format is based on the following factors-

  1. Schema evolution to add, alter and rename fields.
  2. Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.
  3. Splittability to be processed in parallel.
  4. Read/Write/Transfer performance vs block compression saving storage space

File Formats that can be used with Hadoop - CSV, JSON, Columnar, Sequence files, AVRO, and Parquet file.

CSV Files 

CSV files are an ideal fit for exchanging data between Hadoop and external systems. It is advisable not to use header and footer lines when using CSV files.

JSON Files

Every JSON File has its own record. JSON stores both data and schema together in a record and also enables complete schema evolution and suitability. However, JSON files do not support block level compression.

Avro files

This kind of file format is best suited for long-term storage with Schema. Avro files store metadata with data and also let you specify an independent schema for reading the files.

Parquet Files

A columnar file format that supports block-level compression and is optimized for query performance as it allows selection of 10 or fewer columns from 50+ columns records.

answered Sep 27, 2018 by zombie
• 3,690 points

Related Questions In Big Data Hadoop

0 votes
1 answer
0 votes
1 answer

Hadoop Hive Hbase: How to insert data into Hbase using Hive (JSON file)?

You can use the get_json_object function to parse the ...READ MORE

answered Nov 15, 2018 in Big Data Hadoop by Omkar
• 65,850 points
242 views
0 votes
1 answer

How to get started with Hadoop and do some development using Eclipse IDE?

Alright, there are couple of things that ...READ MORE

answered Apr 4, 2018 in Big Data Hadoop by Ashish
• 2,630 points
51 views
+1 vote
2 answers

How to authenticate username & password while using Connector for Cloudera Hadoop in Tableau?

Hadoop server installed was kerberos enabled server. ...READ MORE

answered Aug 21, 2018 in Big Data Hadoop by Priyaj
• 56,120 points
125 views
0 votes
1 answer

How to transfer data from Netezza to HDFS using Apache Sqoop?

Remove the --direct option. It gives issue ...READ MORE

answered Apr 23, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
185 views
0 votes
1 answer

How to analyze block placement on datanodes and rebalancing data across Hadoop nodes?

HDFS provides a tool for administrators i.e. ...READ MORE

answered Jun 21, 2018 in Big Data Hadoop by nitinrawat895
• 9,070 points
74 views
0 votes
1 answer
0 votes
1 answer

How to access Hadoop Data using REST service?

The REST API gateway for the Apache ...READ MORE

answered Sep 5, 2018 in Big Data Hadoop by Frankie
• 9,590 points
214 views
0 votes
1 answer

What are the basic concepts for video data structure and processing?

1.  Make use of hadoop storm framework. ( ...READ MORE

answered Sep 7, 2018 in Big Data Hadoop by Frankie
• 9,590 points
20 views
0 votes
2 answers

Difference between Hadoop 1 and 2

Hadoop V.1.x Components Apache Hadoop V.1.x has the ...READ MORE

answered Aug 27, 2018 in Big Data Hadoop by zombie
• 3,690 points
66 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.