Primary keys in Apache Spark

0 votes

I have established a JDBC connection with Apache Spark and PostgreSQL. Now, I want to insert data into my database. If I use append mode, then I need to specify an ID for each DataFrame.Row. Is there any way for Spark to create primary keys?

Sep 11 in Apache Spark by nitinrawat895
• 10,760 points
89 views

1 answer to this question.

0 votes

I found the following solution to be relatively straightforward for the case where zipWithIndex() is the desired behaviour, i.e. for those desiring consecutive integers.

In this case, we're using pyspark and relying on dictionary comprehension to map the original row object to a new dictionary which fits a new schema including the unique index.

# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer

# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
                       + dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
                      .map(lambda (row, id): {k:v
                                              for k, v
                                              in row.asDict().items() + [("uuid", id)]})\
                      .toDF(newSchema)


Hope this helps

answered Sep 11 by ravikiran
• 4,580 points

Related Questions In Apache Spark

+5 votes
11 answers

Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

answered Mar 21 in Apache Spark by anonymous
35,637 views
0 votes
1 answer

cache tables in apache spark sql

Caching the tables puts the whole table ...READ MORE

answered May 4, 2018 in Apache Spark by Data_Nerd
• 2,360 points
925 views
0 votes
1 answer

Ways to create RDD in Apache Spark

There are two popular ways using which ...READ MORE

answered Jun 19, 2018 in Apache Spark by nitinrawat895
• 10,760 points
1,962 views
0 votes
7 answers

How to print the contents of RDD in Apache Spark?

Simple and easy: line.foreach(println) READ MORE

answered Dec 10, 2018 in Apache Spark by Kuber
13,834 views
+1 vote
1 answer
0 votes
1 answer

Moving files in Hadoop using the Java API?

I would recommend you to use FileSystem.rename(). ...READ MORE

answered Apr 15, 2018 in Big Data Hadoop by Shubham
• 13,350 points
960 views
0 votes
1 answer

Hadoop giving java.io.IOException, in mkdir Java code.

I am not sure about the issue. ...READ MORE

answered May 3, 2018 in Big Data Hadoop by Shubham
• 13,350 points
506 views
0 votes
1 answer

Is it possible to run Apache Spark without Hadoop?

Though Spark and Hadoop were the frameworks designed ...READ MORE

answered May 2 in Big Data Hadoop by ravikiran
• 4,580 points
92 views
0 votes
1 answer

Primary keys in Apache Spark

import sqlContext.implicits._ import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructType, StructField, LongType} val df ...READ MORE

answered Aug 9 in Apache Spark by ravikiran
• 4,580 points
161 views
0 votes
1 answer

How do I turn off INFO Logging in Spark?

Execute this command in the spark directory: cp ...READ MORE

answered Jul 12 in Apache Spark by ravikiran
• 4,580 points
294 views