How to launch and configure an EMR cluster using boto

0 votes

I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows:

  1. How to define the cluster to be used (by clusted_id)
  2. How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes)

Am I missing something?

Sep 12, 2018 in AWS by bug_seeker
• 15,360 points
1,278 views

1 answer to this question.

0 votes

Boto and the underlying EMR API is currently mixing the terms cluster and job flow, and job flow is being deprecated. I consider them synonyms.

You create a new cluster by calling the boto.emr.connection.run_jobflow() function. It will return the cluster ID which EMR generates for you.

First all the mandatory things:

#!/usr/bin/env python

import boto
import boto.emr
from boto.emr.instance_group import InstanceGroup

conn = boto.emr.connect_to_region('us-east-1')

Then we specify instance groups, including the spot price we want to pay for the TASK nodes:

instance_groups = []
instance_groups.append(InstanceGroup(
    num_instances=1,
    role="MASTER",
    type="m1.small",
    market="ON_DEMAND",
    name="Main node"))
instance_groups.append(InstanceGroup(
    num_instances=2,
    role="CORE",
    type="m1.small",
    market="ON_DEMAND",
    name="Worker nodes"))
instance_groups.append(InstanceGroup(
    num_instances=2,
    role="TASK",
    type="m1.small",
    market="SPOT",
    name="My cheap spot nodes",
    bidprice="0.002"))

Finally we start a new cluster:

cluster_id = conn.run_jobflow(
    "Name for my cluster",
    instance_groups=instance_groups,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    enable_debugging=True,
    log_uri="s3://mybucket/logs/",
    hadoop_version=None,
    ami_version="2.4.9",
    steps=[],
    bootstrap_actions=[],
    ec2_keyname="my-ec2-key",
    visible_to_all_users=True,
    job_flow_role="EMR_EC2_DefaultRole",
    service_role="EMR_DefaultRole")

We can also print the cluster ID if we care about that:

print "Starting cluster", cluster_id
answered Sep 12, 2018 by Priyaj
• 56,940 points

Related Questions In AWS

+1 vote
2 answers

How to launch and access an instance using AWS-CLI?

aws ec2 run-instances --image-id ami-id --key-name yourkeyname ...READ MORE

answered Feb 23 in AWS by Shashank
• 1,350 points
161 views
0 votes
1 answer

How to create EMR cluster using Python boto3?

The python boto3 code for creating a ...READ MORE

answered Feb 27 in AWS by Priyaj
• 56,940 points
1,169 views
0 votes
1 answer

How to create a EMR Cluster using Java AWS SDK?

The Java code for creating an EMR ...READ MORE

answered Feb 27 in AWS by Priyaj
• 56,940 points
165 views
0 votes
1 answer

Trying to Determine Amazon EC2 instance creation date/time

You can't find as such attribute called ...READ MORE

answered May 29, 2018 in AWS by Flying geek
• 3,160 points
2,566 views
0 votes
1 answer
0 votes
1 answer

DynamoDB : The provided key element does not match the schema

The following applies to the Node.js AWS ...READ MORE

answered Nov 13, 2018 in AWS by Priyaj
• 56,940 points
3,820 views
0 votes
1 answer
0 votes
1 answer

How to create EMR cluster using AWS CLI?

The command to create EMR cluster using ...READ MORE

answered Feb 27 in AWS by Priyaj
• 56,940 points
147 views