Apache Spark - Nested JSON array to flatten columns

Question

Hello,
I have a JSON which is nested and have Nested arrays. How could I use Apache Spark Python script to flatten it in a columnar manner so that I could use it via AWS Glue and use AWS Athena or AWS redshift to query the data?

Omkar · Answer 1 · Jan 2, 2019

It depends on the structure of your JSon file but here I have posted a code that you can refer:

import pandas as pd
from pandas.io.json import json_normalize
import json
with open('user.txt') as f:

json_data = json.load(f)


def flatten_json(y):
out = {}

def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x

flatten(y)
return out

flat = flatten_json(json_data)
dt=json_normalize(flat)

dt is your data frame object containing flattened json.

answered Jan 2, 2019 by Omkar
• 69,180 points

Apache Spark - Nested JSON array to flatten columns

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Big Data Hadoop

Is it possible to run Apache Spark without Hadoop?

Is there a possibility to run Apache Spark without Hadoop?

Is it possible to run Apache Spark without Apache Hadoop?

How do I connect my Spark based HDInsight cluster to my blob storage?

How do I get number of columns in each line from a delimited file??

Hadoop Mapreduce word count Program

hadoop.mapred vs hadoop.mapreduce?

hadoop fs -put command?

Apache Spark gives "Failed to load native-hadoop with error"

How to read more than one files in Apache Spark?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES