Apache Spark - Nested JSON array to flatten columns

–1 vote

Hello,
I have a JSON which is nested and have Nested arrays. How could I use Apache Spark Python script to flatten it in a columnar manner so that I could use it via AWS Glue and use AWS Athena or AWS redshift to query the data?

Jan 2 in Big Data Hadoop by digger
• 26,550 points
1,644 views

1 answer to this question.

0 votes

It depends on the structure of your JSon file but here I have posted a code that you can refer:

import pandas as pd
from pandas.io.json import json_normalize
import json
with open('user.txt') as f:
json_data = json.load(f)


def flatten_json(y):
out = {}

def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out

flat = flatten_json(json_data)
dt=json_normalize(flat)

dt is your data frame object containing flattened json.

answered Jan 2 by Omkar
• 67,660 points

Related Questions In Big Data Hadoop

0 votes
1 answer

Is it possible to run Apache Spark without Hadoop?

Though Spark and Hadoop were the frameworks designed ...READ MORE

answered May 2 in Big Data Hadoop by ravikiran
• 4,560 points
82 views
0 votes
1 answer

Is there a possibility to run Apache Spark without Hadoop?

Spark and Hadoop both are the open-source ...READ MORE

answered Jun 6 in Big Data Hadoop by ravikiran
• 4,560 points
42 views
0 votes
1 answer

Is it possible to run Apache Spark without Apache Hadoop?

First of all, Let us get a ...READ MORE

answered Jun 17 in Big Data Hadoop by ravikiran
• 4,560 points
64 views
0 votes
1 answer

How do I connect my Spark based HDInsight cluster to my blob storage?

Go through this blog: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#access-blobs I went through this ...READ MORE

answered Apr 15, 2018 in Big Data Hadoop by Shubham
• 13,310 points
709 views
0 votes
1 answer
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,730 points
3,371 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,730 points
406 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
16,767 views
0 votes
1 answer

Apache Spark gives "Failed to load native-hadoop with error"

Seems like hadoop path is missing in java.library.path. ...READ MORE

answered Nov 22, 2018 in Big Data Hadoop by Omkar
• 67,660 points
539 views
0 votes
1 answer

How to read more than one files in Apache Spark?

Try this: val text = sc.wholeTextFiles("student/*") text.collect() READ MORE

answered Dec 11, 2018 in Big Data Hadoop by Omkar
• 67,660 points
400 views