Primary keys in Apache Spark

Question

I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?

ravikiran · Answer

import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType}

val df = sc.parallelize(Seq(
    ("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")Extract schema for further usage:val schema = df.schemaAdd id field:val rows = df.rdd.zipWithUniqueId.map{
   case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}Create DataFrame:val dfWithPK = sqlContext.createDataFrame(
  rows, StructType(StructField("id", LongType, false) +: schema.fields))The same thing in&#160;Python:from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, LongType

row = Row("foo", "bar")
row_with_index = Row(*["id"] + df.columns)

df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF()

def make_row(columns):
    def _make_row(row, uid):
        row_dict = row.asDict()
        return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
    return _make_row

f = make_row(df.columns)

df_with_pk = (df.rdd
    .zipWithUniqueId()
    .map(lambda x: f(*x))
    .toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))If you prefer the consecutive number you can replace&#160;zipWithUniqueId&#160;with&#160;zipWithIndex&#160;but it is a little bit more expensive.Directly with&#160;DataFrame&#160;API:(universal Scala, Python, Java, R with pretty much the same syntax)Previously I've missed monotonically increasing id function which should work just fine as long as you don't require consecutive numbers:import org.apache.spark.sql.functions.monotonicallyIncreasingId

df.withColumn("id", monotonicallyIncreasingId).show()
// +---+----+-----------+
// |foo| bar|         id|
// +---+----+-----------+
// |  a|-1.0|17179869184|
// |  b|-2.0|42949672960|
// |  c|-3.0|60129542144|
// +---+----+-----------+While useful monotonically increasing id is non-deterministic. Not only ids may be different from execution to execution but without additional tricks cannot be used to identify rows when subsequent operations contain filters.Note:It is also possible to use the rowNumber&#160;window function:from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

w = Window().orderBy()
df.withColumn("id", rowNumber().over(w)).show()Unfortunately:WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
So unless you have a natural way to partition your data and ensure uniqueness is not particularly useful at this moment.To know more about SQL, It's recommended to join PySpark Course today.

Primary keys in Apache Spark

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Apache Spark

Concatenate columns in apache spark dataframe

cache tables in apache spark sql

Ways to create RDD in Apache Spark

How to print the contents of RDD in Apache Spark?

What do we exactly mean by “Hadoop” – the definition of Hadoop?

I installed Spark but while executing command, I am getting ‘hadoop’ command not found error?

Can we run Spark without using Hadoop?

Joining Multiple Spark Dataframes

Primary keys in Apache Spark

How do I turn off INFO Logging in Spark?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES