A Deep Dive Into Pig

Become a Certified Professional

One of the biggest reasons why the popularity of Hadoop skyrocketed in recent times is the fact that features like Pig and Hive run on top of it allowing non-programmers with functionality that was previously exclusive to Java programmers. These features were a consequence of the growing demand for Hadoop professionals. Other features that are used by Hadoop professionals from non-Java backgrounds are Flume, Sqoop, HBase and Oozie.

To understand why you do not need Java to learn Hadoop, do check out this blog.

Let’s understand how these features work.

We all know that programming knowledge is a necessity for writing MapReduce codes. But what if I have a tool that can do the coding if I would just provide the details? That is where Pig exhibits its muscle power. Pig uses a platform called Pig Latin that abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. The codes written in Pig Latin MapReduce automatically get converted to equivalent MapReduce functions. Isn’t that awesome? Another Mind-Blowing fact is that only 10 Lines of Pig is needed to replace 200 Lines of Java.

10 lines of Pig = 200 lines of Java

This not only means that non-Java professionals use Hadoop but also testifies the underlining fact that Pig is used by an equal number of technical developers.

Additionally, if you want to write your own MapReduce code, you can do that in any of the languages like Perl, Python, Ruby or C. Some basic operations that we can perform on any Dataset using Pig are Group, Join, Filter and Sort. These operations can be performed on structured, un-structured and also semi-structured data. They provide an ad-hoc way for creating and executing MapReduce jobs on very large data sets.

Next up, let’s understand Hive. It is an open source, peta-byte scale data warehousing framework based on Hadoop for data summarization, query and analysis. Hive provides an SQL-like interface for Hadoop. You can use Hive to read and write files on Hadoop and run your reports from a BI tool. Some typical functionality of Hadoop are:

Let me show you a demo using Pig on Clickstream data set
We will use this Clickstream data and perform Transformations, Joins and Groupings.

ClickStream is a series of mouse clicks made by a user while accessing the Internet especially as monitored to assess a person’s interests for marketing purposes. It is mainly used by online retail websites like Flipkart and Amazon who track your activities to generate recommendations. The Clickstream data set that we have made use of has the following fields:

1. Type of language supported by the web application

2. Browser type

3. Connection type

4. Country ID

5. Time Stamp

6. URL

7. User status

8. Type of User

It will look like this with the appropriate fields.

Below is the list of browser types that have been used by various people when surfing on a particular website. Among the list are browsers like Internet Explorer, Google Chrome, Lynx and so on.

Internet connection type can be Lan/ Modem/Wifi. See the image below for the complete list:

In the next image, you will find the list of countries from where the website has attracted audience along with their IDs.

Once we have gathered all the data sets, we have to launch Pig’s Grunt shell, which is launched in order to run the Pig commands.

The first thing we have to do on launching Grunt shell Is to load the Clickstream data into Pig’s relation. A relation is nothing but a table. Below is the command that we use to load a file residing in HDFS onto Pig’s relation.

We can verify the schema of the relation by the command describe click_stream.

We now need to add the reference files which will contain details about the list of countries with their IDs and the different browser types along with their IDs.

We now have two reference files, but they need to be connected to form a relation.
We run a connection_ref command to indicate the type of connection.

Now that we have a working connection and an established relation, we will show you how we can Transform that data.
For each record in Clickstream, we will generate a new record in a different format, i.e the transformed data. The new format will include fields like TimeStamp, Browser type, Country IDs and a few more.

We can perform a Filter operation to trim down the Big Data. The different types of users are Administrators, Guests or Bots. In our demo, I have filtered the list for the Guests.

If you remember, the Country ID is present in the Clickstream and we loaded a country_ref file containing the names of the countries along with its IDs. We can thus perform a Join operation between the two files and merge the data to derive insights.

If we have joined the data, then we can find out the different countries from where the users are by Grouping. Once we have this data, we can perform a Count operation to identify the number of users from a particular country.

It is no rocket science to derive insights from Big Data. These are just some of the many features that I have implemented and with tools like Hive, Hbase, Oozie, Sqoop and Flume there is a treasure of data yet to be explored. So those of you who are holding yourselves back from learning Hadoop, it’s time to change.

Got a question for us? Please mention them in the comments section and we will get back to you.

Related Posts:

Get Started with Big Data and Hadoop

4 Ways to Use R and Hadoop Together

Everything About Cloudera Certified Developer for Apache Hadoop

Recommended videos for you

Apache Spark For Faster Batch Processing

New-Age Search through Apache Solr

Power of Python With BigData

Apache Spark Redefining Big Data Processing

Top Hadoop Interview Questions and Answers – Ace Your Interview

Advanced Security In Hadoop Cluster

Improve Customer Service With Big Data

What Is Hadoop – All You Need To Know About Hadoop

Filtering on HBase Using MapReduce Filtering Pattern

What is Big Data and Why Learn Hadoop!!!

Big Data – XML Parsing With MapReduce

Bulk Loading Into HBase With MapReduce

Reduce Side Joins With MapReduce

Big Data Processing With Apache Spark

Is Hadoop A Necessity For Data Science?

Pig Tutorial – Know Everything About Apache Pig Script

Secure Your Hadoop Cluster With Kerberos

Tailored Big Data Solutions Using MapReduce Design Patterns

HBase Tutorial – A Complete Guide On Apache HBase

Spark SQL | Apache Spark

Recommended blogs for you

Splunk Careers – Your Pathway To Hot Big Data Jobs

Hadoop Career: Career in Big Data Analytics

Why Should a Mainframe Professional Move to Big Data and Hadoop?

DynamoDB vs MongoDB: Which One Meets Your Business Needs Better?

Azure Data Engineer Roadmap in 2025

Introduction to Apache Hive

Apache Pig UDF: Part 3 – Store Functions

Apache Flume Tutorial : Twitter Data Streaming

Game Changing Big Data Use Cases

Hive and Yarn Examples on Spark

Apache Flink: The Next Gen Big Data Analytics Framework For Stream And Batch Data Processing

Brief Introduction to Oozie

Implementing Hadoop & R Analytic Skills in Banking Domain

Switching Careers: From Java to Big Data / Hadoop

Increasing Demand for ‘ Hadoop and NoSQL Skills ’

Steps to Create UDF in Apache Pig

Splunk Lookup and Fields: Splunk Knowledge Objects

Top Hadoop Interview Questions On Apache PIG For 2025

Hadoop Administration Interview Questions and Answers For 2025

Helpful Hadoop Shell Commands

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric DP-700 Certification Trainin ...

PySpark Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Big Data Hadoop Certification Training Course

Apache Kafka Certification Training Course

ELK Stack Training & Certification

Apache Spark and Scala Certification Training ...

Splunk Certification Training: Power User and ...

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

A Deep Dive Into Pig