How to connect Amazon RedShift in Apache Spark

0 votes

I'm trying to connect to Amazon Redshift via Spark, so I can combine data that i have on S3 with data on our RS cluster. I found some a documentation here for the capability of connecting to JDBC:

https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases

The load command seems fairly straightforward

df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?

Any help or pointers to detailed tutorials are appreciated.

Aug 22, 2018 in AWS by datageek
• 2,530 points
7,103 views

1 answer to this question.

0 votes

It turns out you just need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc", 
                     url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret", 
                     dbtable="schema.table")
Try it out. Hope it was helpful.
answered Aug 22, 2018 by Archana
• 4,170 points
Hi Archana,

I tried your code to connect pyspark with redshift but it gave me "AttributeError: type object 'SQLContext' has no attribute 'load'"

Thank You..

Related Questions In AWS

0 votes
1 answer

How to execute scheduled SQL scripts in on Amazon Redshift?

I had faced the same problem earlier. ...READ MORE

answered Nov 19, 2018 in AWS by Archana
• 5,640 points
3,735 views
+2 votes
1 answer
0 votes
1 answer

In Amazon Data Pipeline, how to make sure only once instance of a pipeline is running at any time?

On the CopyTablesActivity, you could set a lateAfterTimeout attribute ...READ MORE

answered Sep 19, 2018 in AWS by Priyaj
• 58,090 points
1,930 views
0 votes
1 answer

How to upload an object into Amazon S3 in Lambda?

I suspect you are calling the context.done() function before s3.upload() has ...READ MORE

answered Sep 27, 2018 in AWS by Archana
• 4,170 points
951 views
+1 vote
1 answer

Can anyone explain what is RDD in Spark?

RDD is a fundamental data structure of ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,490 points
2,416 views
0 votes
1 answer

How can I compare the elements of the RDD using MapReduce?

You have to use the comparison operator ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,490 points
3,165 views
+1 vote
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 13,490 points
7,949 views
0 votes
1 answer

How to save and retrieve the Spark RDD from HDFS?

You can save the RDD using saveAsObjectFile and saveAsTextFile method. ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 13,490 points
13,038 views
0 votes
1 answer

How to restore a cluster from backup in Amazon Redshift

Amazon Redshift replicates all your data within ...READ MORE

answered Sep 28, 2018 in AWS by Archana
• 4,170 points
1,684 views
0 votes
1 answer

How to upload a lib for Tomcat in Amazon EC2?

You need to set the proper privileges ...READ MORE

answered Aug 20, 2018 in AWS by Archana
• 4,170 points
711 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP