Import zip files and process the excel files ( inside the zip files ) by using pyspark connecting with pymongo

+2 votes

How can I import zip files and process the excel files ( inside the zip files ) by using pyspark connecting with pymongo?

I was install spark and mongodb and python to process the files (excel, csv or json)

I used this code to connect pyspark with mmongo :

from pyspark.sql import SparkSession

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.coll") \
    .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
    .getOrCreate()

but then I was try to import zip files ( I don't need to open every files to process it )

Aug 10 in Python by Ahmed
• 310 points
43 views

1 answer to this question.

0 votes

I found this sample code:

import zipfile
import io

def zip_extract(x):
    in_memory_data = io.BytesIO(x[1])
    file_obj = zipfile.ZipFile(in_memory_data, "r")
    files = [i for i in file_obj.namelist()]
    return dict(zip(files, [file_obj.open(file).read() for file in files]))

zips = sc.binaryFiles("dbfs:/mnt/vedant-demo/ONG/data/las_raw/D-Dfiles.zip")
files_data = zips.map(zip_extract)

Check if this works. Source: https://gist.github.com/vedantja/bd74d0ba7c350dd348af1f92eadd0e76

answered Aug 19 by Reshma

Related Questions In Python

0 votes
0 answers
0 votes
1 answer
0 votes
0 answers

How to save the import csv file to mongodb using pyspark (or python)?

I have this code, and I want ...READ MORE

Oct 9 in Python by Ahmed
• 310 points
177 views
+1 vote
2 answers

how can i count the items in a list?

Syntax :            list. count(value) Code: colors = ['red', 'green', ...READ MORE

answered Jul 6 in Python by Neha
• 330 points

edited Jul 8 by Kalgi 351 views
0 votes
1 answer

How to import json file to mongodb using pyspark (or python)?

You can use the same format as ...READ MORE

answered Sep 9 in Python by Karan
123 views
0 votes
1 answer