AWS Architect Certification Training (55 Blogs) Become a Certified Professional
AWS Global Infrastructure

Cloud Computing

Topics Covered
  • AWS Architect Certification Training (46 Blogs)
  • AWS Development (8 Blogs)
  • SFDC Administration Foundation (3 Blogs)
  • Salesforce Admin and Dev Foundation (9 Blogs)
SEE MORE

MI-new-launch

myMock Interview Service for Real Tech Jobs

myMock-widget-banner-bg

AWS Data Pipeline Tutorial – A Data Workflow Orchestration Service

Last updated on May 22,2019 5.8K Views
3 / 5 Blog from AWS Database Services

MI-new-launch

myMock Interview Service for Real Tech Jobs

myMock-mobile-banner-bg

myMock Interview Service for Real Tech Jobs

  • Mock interview in latest tech domains i.e JAVA, AI, DEVOPS,etc
  • Get interviewed by leading tech experts
  • Real time assessment report and video recording

AWS Data Pipeline Tutorial

With advancement in technologies & ease of connectivity, the amount of data getting generated is skyrocketing. Buried deep within this mountain of data is the “captive intelligence” that companies can use to expand and improve their business. Companies need to move, sort, filter, reformat, analyze, and report data in order to derive value from it. They might have to do this repetitively and at a rapid pace, to remain steadfast in the market. AWS Data Pipeline service by Amazon is the perfect solution.

Let’s take a look at the topics covered in this AWS Data Pipeline tutorial:

Need for AWS Data Pipeline

Data is growing exponentially and that too at a faster pace. Companies of all sizes are realizing that managing, processing, storing & migrating the data has become more complicated & time-consuming than in the past. So, listed below are some of the issues that companies are facing with ever increasing data: 

 data - AWS Data Pipeline - EdurekaBulk amount of Data: There is a lot of raw & unprocessed data. There are log files, demographic data, data collected from sensors, transaction histories & lot more.

data - AWS Data Pipeline - EdurekaVariety of formats: Data is available in multiple formats. Converting unstructured data to a compatible format is a complex & time-consuming task.

data - AWS Data Pipeline - EdurekaDifferent data stores: There are a variety of data storage options. Companies have their own data warehouse, cloud-based storage like Amazon S3, Amazon Relational Database Service(RDS) & database servers running on EC2 instances

data - AWS Data Pipeline - EdurekaTime-consuming & costly: Managing bulk of data is time-consuming & a very expensive. A lot of money is to be spent on transform, store & process data.

All these factors make it more complex & challenging for companies to manage data on their own. This is where AWS Data Pipeline can be useful. It makes it easier for users to integrate data that is spread across multiple AWS services and analyze it from a single location. So, through this AWS Data Pipeline Tutorial lets explore Data Pipeline and its components.

AWS Data Pipeline Tutorial | Data Workflow Orchestration Service | Edureka

What is AWS Data Pipeline?

datapipelinelogo - AWS Data Pipeline Tutorial- EdurekaAWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals.

With AWS Data Pipeline you can easily access data from the location where it is stored, transform & process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR. It allows you to create complex data processing workloads that are fault tolerant, repeatable, and highly available.  

Now why choose AWS Data Pipeline?

Benefits of AWS Data Pipeline

Benefits - AWS DataPipeline Tutorial- Edureka

    • Provides a drag-and-drop console within the AWS interface
    • AWS Data Pipeline is built on a distributed, highly available infrastructure designed for fault tolerant execution of your activities
    • It provides a variety of features such as scheduling, dependency tracking, and error handling 
    • AWS Data Pipeline makes it equally easy to dispatch work to one machine or many, in serial or parallel
    • AWS Data Pipeline is inexpensive to use and is billed at a low monthly rate
    • Offers full control over the computational resources that execute your data pipeline logic


So, with benefits out of the way, let’s take a look at different components of AWS Data Pipeline & how they work together to manage your data.

Want To Take Your 'Cloud' Knowledge To Next Level?

Components of AWS Data Pipeline

AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. You can define data-driven workflows so that tasks can be dependent on the successful completion of previous tasks. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you’ve set up.

datapipeline - AWS Data Pipeline Tutorial - Edureka
 Fig 1: AWS Data Pipeline – AWS Data Pipeline Tutorial – Edureka

Basically, you always begin designing a pipeline by selecting the data nodes. Then data pipeline works with compute services to transform the data. Most of the time a lot of extra data is generated during this step. So optionally, you can have output data nodes, where the results of transforming the data can be stored & accessed from.

Data Nodes: In AWS Data Pipeline, a data node defines the location and type of data that a pipeline activity uses as input or output. It supports data nodes like:

  • DynamoDBDataNode
  • SqlDataNode
  • RedshiftDataNode
  • S3DataNode

Now, let’s consider a real-time example to understand other components.

Use Case: Collect data from different data sources, perform Amazon Elastic MapReduce(EMR) analysis & generate weekly reports.

pipeline - AWS Data Pipeline - Edureka

In this use case, we are designing a pipeline to extract data from data sources like Amazon S3 & DynamoDB to perform EMR analysis daily & generate weekly reports on data.

Now the words that I italicized are called activities. Optionally, for these activities to run we can add preconditions.

Activities: An activity is a pipeline component that defines the work to perform on schedule using a computational resource and typically input and output data nodes. Examples of activities are:

  • Moving data from one location to another
  • Running Hive queries
  • Generating Amazon EMR reports

Preconditions: A precondition is a pipeline component containing conditional statements that must be true before an activity can run.

  • Check whether source data is present before a pipeline activity attempts to copy it
  • If or not a respective database table exists

Resources: A resource is a computational resource that performs the work that a pipeline activity specifies.

  • An EC2 instance that performs the work defined by a pipeline activity
  • An Amazon EMR cluster that performs the work defined by a pipeline activity

Finally, we have a component called actions. 

Actions: Actions are steps that a pipeline component takes when certain events occur, such as success, failure, or late activities.

  • Send an SNS notification to a topic based on success, failure, or late activities
  • Trigger the cancellation of a pending or unfinished activity, resource, or data node 

Now that you have the basic idea of AWS Data Pipeline & its components, let’s see how it works. 

Demo on AWS Data Pipeline

In this demo part of AWS Data Pipeline Tutorial article, we are going to see how to copy the contents of a DynamoDB table to S3 Bucket. AWS Data Pipeline triggers an action to launch EMR cluster with multiple EC2 instances(make sure to terminate them after you are done to avoid charges). EMR cluster picks up the data from dynamoDB and writes to S3 bucket.

Creating an AWS Data Pipeline

Step1: Create a DynamoDB table with sample test data. 

DynamoDB - AWS Data Pipeline - Edureka

Step2: Create a S3 bucket for the DynamoDB table’s data to be copied.

S3bucket - AWS Data Pipeline - Edureka

Step3: Access the AWS Data Pipeline console from your AWS Management Console & click on Get Started to create a data pipeline.

DataP - AWS Data Pipeline - Edureka

Step4: Create a data pipeline. Give your pipeline a suitable name & appropriate description. Specify source & destination data node paths. Schedule your data pipeine & click on activate.

DataP - AWS Data Pipeline - Edureka

Monitoring & Testing

Step5: In the List Pipelines you can see the status as “WAITING FOR RUNNER”.

DataP - AWS Data Pipeline - Edureka

Step6: After a few minutes you can see the status has again changed to “RUNNING”. At this point, if you go to EC2 console, you can see two new instances created automatically. This is because of the EMR cluster triggered by Pipeline.

DataP - AWS Data Pipeline - Edureka

Step7: After finishing, you can access S3 bucket and find out if the .txt file is created. It contains the DynamoDB table’s contents. Download it an open in a text editor.

So, now you know how to use AWS Data Pipeline to export data from DynamoDB. Similarly, by reversing source & destination you can import data to DynamoDB from S3.

Go ahead and explore!

So this is it! I hope this AWS Data Pipeline Tutorial was informative and added value to your knowledge. If you are interested to take your knowledge on Amazon Web Services to the next level then enroll for the AWS Architect Certification Training course by Edureka.

Got a question for us? Please mention it in the comments section of “AWS Data Pipeline” and we will get back to you.

Comments
0 Comments

Browse Categories

webinar REGISTER FOR FREE WEBINAR
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP

Subscribe to our Newsletter, and get personalized recommendations.