How to connect Amazon RedShift in Apache Spark?

0 votes

I'm trying to connect to Amazon Redshift via Spark, so I can combine data that i have on S3 with data on our RS cluster. I found some a documentation here for the capability of connecting to JDBC:

https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases

The load command seems fairly straightforward

df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?

Any help or pointers to detailed tutorials are appreciated.

Aug 22, 2018 in AWS by datageek
• 2,440 points
2,280 views

1 answer to this question.

0 votes

It turns out you just need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc", 
                     url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret", 
                     dbtable="schema.table")
Try it out. Hope it was helpful.
answered Aug 22, 2018 by Archana
• 4,090 points
Hi Archana,

I tried your code to connect pyspark with redshift but it gave me "AttributeError: type object 'SQLContext' has no attribute 'load'"

Thank You..

Related Questions In AWS

0 votes
1 answer

How to execute scheduled SQL scripts in on Amazon Redshift?

I had faced the same problem earlier. ...READ MORE

answered Nov 19, 2018 in AWS by Archana
• 5,560 points
964 views
+1 vote
1 answer
0 votes
1 answer

In Amazon Data Pipeline, how to make sure only once instance of a pipeline is running at any time?

On the CopyTablesActivity, you could set a lateAfterTimeout attribute ...READ MORE

answered Sep 19, 2018 in AWS by Priyaj
• 56,900 points
354 views
0 votes
1 answer

How to upload an object into Amazon S3 in Lambda?

I suspect you are calling the context.done() function before s3.upload() has ...READ MORE

answered Sep 27, 2018 in AWS by Archana
• 4,090 points
61 views
0 votes
1 answer

Can anyone explain what is RDD in Spark?

RDD is a fundamental data structure of ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,300 points
626 views
0 votes
1 answer

How can I compare the elements of the RDD using MapReduce?

You have to use the comparison operator ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,300 points
431 views
0 votes
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 13,300 points
1,530 views
0 votes
1 answer

How to save and retrieve the Spark RDD from HDFS?

You can save the RDD using saveAsObjectFile and saveAsTextFile method. ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 13,300 points
2,461 views
0 votes
1 answer

How to restore a cluster from backup in Amazon Redshift

Amazon Redshift replicates all your data within ...READ MORE

answered Sep 28, 2018 in AWS by Archana
• 4,090 points
339 views
0 votes
1 answer

How to upload a lib for Tomcat in Amazon EC2?

You need to set the proper privileges ...READ MORE

answered Aug 20, 2018 in AWS by Archana
• 4,090 points
50 views