How to connect Amazon RedShift in Apache Spark?

0 votes

I'm trying to connect to Amazon Redshift via Spark, so I can combine data that i have on S3 with data on our RS cluster. I found some a documentation here for the capability of connecting to JDBC:

https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases

The load command seems fairly straightforward

df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?

Any help or pointers to detailed tutorials are appreciated.

Aug 22, 2018 in AWS by datageek
• 2,430 points
1,093 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

It turns out you just need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc", 
                     url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret", 
                     dbtable="schema.table")
Try it out. Hope it was helpful.
answered Aug 22, 2018 by Archana
• 4,090 points

Related Questions In AWS

0 votes
1 answer

How to execute scheduled SQL scripts in on Amazon Redshift?

I had faced the same problem earlier. ...READ MORE

answered Nov 19, 2018 in AWS by Archana
• 5,360 points
550 views
+1 vote
1 answer
0 votes
1 answer

In Amazon Data Pipeline, how to make sure only once instance of a pipeline is running at any time?

On the CopyTablesActivity, you could set a lateAfterTimeout attribute ...READ MORE

answered Sep 19, 2018 in AWS by Priyaj
• 56,140 points
160 views
0 votes
1 answer

How to upload an object into Amazon S3 in Lambda?

I suspect you are calling the context.done() function before s3.upload() has ...READ MORE

answered Sep 27, 2018 in AWS by Archana
• 4,090 points
29 views
0 votes
1 answer

Can anyone explain what is RDD in Spark?

RDD is a fundamental data structure of ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 12,890 points
510 views
0 votes
1 answer

How can I compare the elements of the RDD using MapReduce?

You have to use the comparison operator ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 12,890 points
187 views
0 votes
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 12,890 points
752 views
0 votes
1 answer

How to save and retrieve the Spark RDD from HDFS?

You can save the RDD using saveAsObjectFile and saveAsTextFile method. ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 12,890 points
1,249 views
0 votes
1 answer

How to restore a cluster from backup in Amazon Redshift

Amazon Redshift replicates all your data within ...READ MORE

answered Sep 28, 2018 in AWS by Archana
• 4,090 points
141 views
0 votes
1 answer

How to upload a lib for Tomcat in Amazon EC2?

You need to set the proper privileges ...READ MORE

answered Aug 20, 2018 in AWS by Archana
• 4,090 points
30 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.