GCP Data Proc

+2 votes

How can I create a Google Cloud Storage bucket to use for my Google Cloud cluster and copy the Py Spark application to the bucket in my project. The Py spark app has been shared from a Cloud Storage bucket: GS://training/root.py. Please can you help solve this problem with what steps to take and if possible with pics?

Nov 26 by Deepthi
• 140 points
24 views
So if I have understood correctly, you need a cloud storage bucket that can be used by your google cloud cluster to copy the py spark application to the bucket?
Correct me if I am wrong.

Thank you!

1 answer to this question.

0 votes

Hey @Deepthi, you could do this:

Run the steps below to prepare to run the code in this tutorial.

  1. Set up your project. If necessary, set up a project with the Cloud Dataproc, Compute Engine, and Cloud Storage APIs enabled and the Cloud SDK installed on your local machine.

    • Select or create a GCP project.

    • Make sure that billing is enabled for your Google Cloud Platform project.

    • Enable the Cloud Dataproc, Compute Engine, and Cloud Storage APIs.

    • Install and initialize the Cloud SDK.

  2. Create a Cloud Storage bucket. You need a Cloud Storage to hold tutorial data. If you do not have one ready to use, create a new bucket in your project.

    1. In the GCP Console, go to the Cloud Storage Browser page.

    2. Click Create bucket.

    3. In the Create bucket dialog, specify the following attributes:

      • A unique bucket name.

      • A storage class.

      • A location where bucket data will be stored.

    4. Click Create.

  3. Set local environment variables. Set environment variables on your local machine. Set your GCP project-id and the name of the Cloud Storage bucket you will use. Also provide the name and zone of an existing or new Cloud Dataproc cluster. You can create a cluster to use in the next step.

    PROJECT=project-id
    BUCKET_NAME=bucket-name
    CLUSTER=cluster-name
    ZONE=cluster-region Example: "us-west1-a"

  4. Create a Cloud Dataproc cluster. Run the command, below, to create a single-node Cloud Dataproc cluster in the specified Compute Engine zone.

    gcloud dataproc clusters create $CLUSTER \
        --project=${PROJECT} \
        --zone=${ZONE} \
        --single-node
    
  5. Copy your pyspark application from that specific cloud storage bucket to your Cloud Storage bucket. 

    gsutil cp gs://training/root.py  gs://${BUCKET_NAME}

For more info refer to https://cloud.google.com/dataproc/docs/tutorials/gcs-connector-spark-tutorial 

answered Nov 26 by Karan
• 6,650 points

Related Questions

0 votes
1 answer

HIVE DATA LOADING ERROR

Dear Raghu, Hope you are doing great. It is ...READ MORE

answered Dec 17, 2017 in Data Analytics by Sudhir
• 1,610 points
250 views
0 votes
1 answer

R query and Data Science

Dear Deepika, Hope you are doing great. You can ...READ MORE

answered Dec 17, 2017 in Data Analytics by Sudhir
• 1,610 points
41 views
0 votes
1 answer

Big Data transformations with R

Dear Koushik, Hope you are doing great. You can ...READ MORE

answered Dec 17, 2017 in Data Analytics by Sudhir
• 1,610 points
66 views
0 votes
1 answer

What is ClickStream Data Analysis

On a Web site, clickstream analysis (also ...READ MORE

answered Mar 21, 2018 in Big Data Hadoop by kurt_cobain
• 9,280 points
50 views
0 votes
1 answer
0 votes
1 answer

Is Kafka and Zookeeper are required in a Big Data Cluster?

Apache Kafka is one of the components ...READ MORE

answered Mar 22, 2018 in Big Data Hadoop by nitinrawat895
• 10,760 points
463 views
+1 vote
3 answers

Is it possible to store data about arbitrary objects on the blockchain using smart contracts?

Consider following tutorial from Hyperledger Fabric "Getting Started" pages. Basically ...READ MORE

answered Aug 30, 2018 in Blockchain by Artem
154 views
0 votes
2 answers

"Train" and "Test" sets in Data Science

Normally to perform supervised learning you need ...READ MORE

answered Aug 2, 2018 in Data Analytics by Anmol
• 3,620 points
69 views
0 votes
1 answer

Hadoop vs Data Lake

A data lake is a storage repository that holds ...READ MORE

answered Mar 26, 2018 in Big Data Hadoop by Ashish
• 2,630 points
536 views
+1 vote
1 answer

Is Hadoop only Framework in Big Data Ecosystem ?

Actually there are many other frameworks, one of ...READ MORE

answered Mar 26, 2018 in Big Data Hadoop by Ashish
• 2,630 points
89 views