Big Data in AWS | Getting started with Big Data in AWS

AWS Development (7 Blogs)

The idea of Big Data is simply not new, it is everywhere. The effect of Big Data is everywhere, from business to science, from the government to the arts and so on. There is no better companion than AWS to process and analyze Big Data. Learn all about the tools and systems used by the Big Data Experts from the Big Data Course.

In this article, I am going to show how AWS tackles the challenges of Big Data and the pointers that I am going to cover are as follows:

What is Big Data?

You can consider Big data as high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

Big Data comprises of 5 important V’s which defines the characteristics of Big Data. Let us discuss these ones before moving to AWS.

What is AWS?

AWS comprises of many different cloud computing products and services. The highly profitable Amazon division provides servers, storage, networking, remote computing, email, mobile development along with security. Furthermore. AWS consists of two main products: EC2, Amazon’s virtual machine service, and S3, a storage system by Amazon. It is so large and present in the computing world that it’s now at least 10 times the size of its nearest competitor and hosts popular websites like Netflix and Instagram.

AWS is split into 12 global regions worldwide, each of which has multiple availability zones in which its servers are located. These serviced regions are split in order to allow users to set geographical limits on their services, but also to provide security by diversifying the physical locations in which data is held. You can get a better understanding with the Azure Data Engineer Course in Mumbai.

Check out our AWS Certification Training in Top Cities

India	Other Countries/Cities
Hyderabad	Atlanta
Bangalore	Canada
Chennai	Dubai
Mumbai	London
Pune	UK

Why Big Data in AWS?

Scientists, developers, and other technology enthusiasts from many different domains are taking advantage of AWS to perform big data analytics and meet the critical challenges of the increasing Vs of digital information. AWS offers you a portfolio of cloud computing services to help manage big data by significantly reducing costs, scaling to meet demand, and increasing the speed of innovation.

Amazon Web Services provides a fully integrated portfolio of cloud computing services. Furthermore, it helps you build, secure, and deploy your big data applications. Also, with AWS, you don’t need hardware to procure and infrastructure to maintain and scale. Due to this, you can focus your resources on uncovering new insights. Since new features are added constantly, you’ll always be able to leverage the latest technologies without requiring to make long-term investment commitments.

How AWS can solve Big Data Challenges?

AWS solutions for Big Data

AWS has numerous solutions for all development and deployment purposes. Also, in the field of Data Science and Big Data, AWS has come up with recent developments in different aspects of Big Data handling. Before jumping to tools, let us understand different aspects of Big Data for which AWS can provide solutions.

Data Ingestion
Collecting the raw data — transactions, logs, mobile devices and more — is the first challenge many organizations face when dealing with big data. A good big data platform makes this step easier, allowing developers to ingest a wide variety of data — from structured to unstructured — at any speed — from real-time to batch.
Storage of Data
Any big data platform needs a secure, scalable, and durable repository to store data prior to or even after processing tasks. Depending on your specific requirements, you may also need temporary stores for data-in-transit.
Data Processing
This is the step where data transformation happens from its raw state into a consumable format — usually by means of sorting, aggregating, joining and even performing more advanced functions and algorithms. The resulting data sets undergo storage for further processing or made available for consumption via business intelligence and data visualization tools.
Visualization
Big data is all about getting high value, actionable insights from your data assets. Ideally, data is available to stakeholders through self-service business intelligence and agile data visualization tools that allow for fast and easy exploration of datasets.

You can even check out the details of Big Data with the Microsoft Azure Data Engineering Certification Course (DP-203)

AWS Tools for Big Data

In the previous sections, we looked at the fields in Big Data where AWS can provide solutions. Additionally, AWS has multiple tools and services in its arsenal to enable customers with the capabilities of Big Data. Learn more about the tools and application from the Big Data Training in Chicago.

Let us look at the various solutions provided by AWS for handling different stages involved in handling Big Data

Ingestion

Kinesis
Amazon Kinesis Firehose is a fully managed service for delivering real-time streaming data directly to Amazon S3. Kinesis Firehose automatically scales to match the volume and throughput of streaming data and requires no ongoing administration. You can configure Kinesis Firehose to transform streaming data before you store it in Amazon S3.
Snowball
You can use AWS Snowball to securely and efficiently migrate bulk data from on-premises storage platforms and Hadoop clusters to S3 buckets. After you create a job in the AWS Management Console, you automatically get a Snowball appliance. After a Snowball arrives, connect it to your local network, install the Snowball client on your on-premises data source, and then use the Snowball client to select and transfer the file directories to the Snowball device.

Storage

Amazon S3

Amazon S3 is a secure, highly scalable, durable object storage with millisecond latency for data access. S3 can store any type of data from anywhere — websites and mobile apps, corporate applications, and data from IoT sensors or devices. It can also store and retrieve any amount of data, with unmatched availability, and built from the ground up to deliver 99.999999999% (11 nines) of durability.

2. AWS Glue

AWS Glue is a fully managed service that provides a data catalog to make data in the data lake discoverable. Additionally, it has the ability to do extract, transform, and load (ETL) to prepare data for analysis. Also, the inbuilt data catalog is like a persistent metadata store for all data assets, making all of the data searchable, and queryable in a single view.

Processing

EMR
For big data processing using the Spark and Hadoop, Amazon EMR provides a managed service that makes it easy, fast, and cost-effective to process vast amounts of data. Furthermore, EMR supports 19 different open-source projects including Hadoop, Spark, and Also it comes with managed EMR Notebooks for data engineering, data science development, and collaboration.
Redshift
For data warehousing, Amazon Redshift provides the ability to run complex, analytic queries against petabytes of structured data. Also, it includes Redshift Spectrum that runs SQL queries directly against Exabytes of structured or unstructured data in S3 without the need for unnecessary data movement.

Visualizations

Amazon QuickSight
For dashboards and visualizations, Amazon Quicksight provides you fast, cloud-powered business analytics service. It makes it easy to build stunning visualizations and rich dashboards. Additionally, you can access them from any browser or mobile device.

For a detailed, You can even check out the details of Migrating to AWS with the AWS Cloud Migration.

Demo – Analyzing Data of Endangered Species of Plants and Animals in Australia.

In this Demo, we will use sample data of endangered plant and animal species from the states and territories of Australia. Here we will create an EMR cluster and configure it to run multi-step Apache Hive jobs. The EMR cluster will have Apache Hive installed in it. This cluster will use EMRFS as the file system, so that its data input and output locations are mapped to an S3 bucket. The cluster will also use the same S3 bucket for storing log files.

We will now create a number of EMR steps in the cluster to process a sample set of data. Here each of these steps will run a Hive script, and the final output will be saved to the S3 bucket. These steps will generate MapReduce logs and that is because Hive commands are translated to MapReduce jobs at run time. The log files for each step are aggregated from the containers it spawns.

Sample Data

The sample data set for this use case is publicly available from the Australian government’s open data website. This data set is about threatened animal and plant species from different states and territories in Australia.

Processing Steps

The first EMR job step here involves creating a Hive table as a schema for the underlying source file in S3. In the second job step, we will now run a successful query against the data. Similarly, we will then run a third and fourth query.

We will repeat these four steps a few times in an hour, simulating successive runs of a multi-step batch job. However, in a real-life scenario, the time difference between each batch run normally could be much higher. The small-time gap between successive runs is intended to accelerate our testing.

S3 Bucket and Folders

Before creating our EMR cluster, here we had to create an S3 bucket to host its files. In our example, we name this bucket “arvind1-bucket” The folders under this bucket are shown below in the AWS Console for S3:

The input folder holds the sample data
The scripts folder contains the Hive script files for EMR job steps
The output folder will obviously hold the Hive program output
The EMR cluster uses the logs folder to save its log files.

Hive Scripts for EMR Job Steps

1. This job step runs a Hive script to create an external Hive table. This table describes the tabular schema of the underlying CSV data file. The script for this is as follows:

 CREATE EXTERNAL TABLE `threatened_species`( `scientific name` string, `common name` string, `current scientific name` string, `threatened status` string, `act` string, `nsw` string, `nt` string, `qld` string, `sa` string, `tas` string, `vic` string, `wa` string, `aci` string, `cki` string, `ci` string, `csi` string, `jbt` string, `nfi` string, `hmi` string, `aat` string, `cma` string, `listed sprat taxonid` bigint, `current sprat taxonid` bigint, `kingdom` string, `class` string, `profile` string, `date extracted` string, `nsl name` string, `family` string, `genus` string, `species` string, `infraspecific rank` string, `infraspecies` string, `species author` string, `infraspecies author` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://arvind1-bucket/script/'

2. This job step runs a query to calculate the top five endangered species in the state of New South Wales (NSW). The Hive query file name is endangeredSpeciesNSW.q and it’s shown below:

SELECT species, COUNT(nsw)AS number_of_endangered_species FROM threatened_species WHERE (nsw = 'Yes'  OR nsw = 'Endangered') AND "threatened status" = 'Endangered' GROUP BY species HAVING COUNT(nsw) &amp;gt; 1 ORDER BY number_of_endangered_species DESC LIMIT 5

3. This job step runs a query to calculate the total number of endangered plant species for each plant family in Australia. The Hive query file name is endangeredPlantSpecies.q and is shown below

 SELECT family, COUNT(species) AS number_of_endangered_species FROM threatened_species2 WHERE kingdom = 'Plantae' AND "threatened status" = 'Endangered' GROUP BY family

4. This step lists the scientific names of extinct animal species in Australia’s Queensland state. The script file is called extinctAnimalsQLD.q and is shown below:

 SELECT "common name", "scientific name" FROM threatened_species WHERE kingdom = 'Animalia' AND (qld = 'Yes' OR qld = 'Extinct') AND "threatened status" = 'Extinct'

Log Aggregation

Here we have also uploaded a JSON file called logAggregation.json in the scripts folder of the S3 bucket. We use this file for aggregating the YARN log files. Log aggregation is configured in the yarn-site.xml configuration file when the cluster starts up. The contents of logAggregation.json file are as follows:

[ { “Classification”: “yarn-site”, “Properties”: { “yarn.log-aggregation-enable”: “true”, “yarn.log-aggregation.retain-seconds”: “-1”, “yarn.nodemanager.remote-app-log-dir”: “s3://arvind1-bucket/logs” } } ]

After you create the S3 bucket and copy the data and script files to their respective folders it is now time to set up an EMR cluster. The following snapshots describe the process as we create the cluster with mostly default settings.

EMR Cluster Setup

In the first image, to configure the cluster in the AWS console, we have kept all of the applications recommended by EMR, including Hive. We do not need to use AWS Glue for storing Hive metadata, nor are we adding any job step at this time. However, we need to add a software setting for Hive. Here you must carefully observe how we are specifying the path to the log aggregation JSON file in this field.

In the next step, we have kept all the default settings. For the sake of our test, the cluster will have one master node and two core nodes. Each node here is an m3.xlarge instance and has 10 GB root volume. We are naming the cluster arvind1-cluster in the next step, and specifying the custom s3 location for its log files.

Finally, we specified an EC2 key pair for the purpose of accessing the cluster’s master node. There is no change in the default IAM roles for EMR, EC2 instance profile, and auto-scale options. Also, the master and core nodes are using by default available security groups. Normally, this is a default setup for an EMR cluster. Once everything is ready, the cluster is in a “waiting” status as shown below:

Submit Hive Job Steps

After this, we need to allow SSH access.

Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
Choose Clusters.
Choose the Name of the cluster.
Under Security and access choose the Security groups for Master link.
Choose ElasticMapReduce-master from the list.
Choose Inbound, Edit.
Find the rule with the following settings and choose the x icon to delete it:
- Type SSH
- Port 22
- Source Custom 0.0.0.0/0
Scroll to the bottom of the list of rules and choose Add Rule.
For Type, select SSH.This automatically enters TCP for Protocol and 22 for Port Range.
For source, select My IP.This automatically adds the IP address of your client computer as the source address. Alternatively, you can add a range of Custom trusted client IP addresses and choose to add the rule to create additional rules for other clients. In many network environments, you allocate IP addresses dynamically, so you may need to periodically edit security group rules to update the IP address of trusted clients.
Choose Save.
Optionally, choose ElasticMapReduce-slave from the list and repeat the steps above to allow the SSH client access to core and task nodes from trusted clients.

Since the EMR cluster is up and running, we have added four job steps. These are the steps EMR would run one after another. The following image shows the steps from AWS EMR console:

Once we add the four steps, we can check the status of these steps as completed. Even if there is some problem with the execution of these steps, then in such cases it can be solved using the log files of these steps.

So this is it from my side in this article on Big Data in AWS. I hope you have understood everything that I have explained here.

If you found this Big Data in AWS relevant, you can check out Edureka’s live and instructor-led course on AWS Online Training, co-created by industry practitioners.

Got a question for us? Please mention it in the comments section of this How to Deploy Java Web Application in AWS and we will get back to you.