How to connect Amazon RedShift in Apache Spark

Question

I'm trying to connect to Amazon Redshift via Spark, so I can combine data that i have on S3 with data on our RS cluster. I found some a documentation here for the capability of connecting to JDBC:

https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases

The load command seems fairly straightforward

df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?

Any help or pointers to detailed tutorials are appreciated.

Archana · Answer 1 · Aug 22, 2018

It turns out you just need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc", 
                     url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret", 
                     dbtable="schema.table")

Try it out. Hope it was helpful.

answered Aug 22, 2018 by Archana
• 4,170 points

Hi Archana,

I tried your code to connect pyspark with redshift but it gave me "AttributeError: type object 'SQLContext' has no attribute 'load'"

Thank You..