Primary keys in Apache Spark

Question

I have established a JDBC connection with Apache Spark and PostgreSQL. Now, I want to insert data into my database. If I use append mode, then I need to specify an ID for each DataFrame.Row. Is there any way for Spark to create primary keys?

ravikiran · Answer 1 · Sep 11, 2019

I found the following solution to be relatively straightforward for the case where zipWithIndex() is the desired behaviour, i.e. for those desiring consecutive integers.

In this case, we're using pyspark and relying on dictionary comprehension to map the original row object to a new dictionary which fits a new schema including the unique index.

# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer

# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
                       + dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
                      .map(lambda (row, id): {k:v
                                              for k, v
                                              in row.asDict().items() + [("uuid", id)]})\
                      .toDF(newSchema)

Hope this helps