Operators in Apache Pig: Part 1- Relational Operators - Edureka

Big Data and Hadoop (165 Blogs) Become a Certified Professional

Become a Certified Professional

This post is about the operators in Apache Pig. Apache Pig also enables you to write complex data transformations without the knowledge of Java, making it really important for the Big Data Hadoop Certification projects. Let’s take a quick look at what Pig and Pig Latin is and the different modes in which they can be operated, before heading on to Operators.

What is Apache Pig?

Apache Pig is a high-level procedural language for querying large data sets using Hadoop and the Map Reduce Platform. It is a Java package, where the scripts can be executed from any language implementation running on the JVM. This is greatly used in iterative processes.

Apache Pig simplifies the use of Hadoop by allowing SQL-like queries to a distributed dataset and makes it possible to create complex tasks to process large volumes of data quickly and effectively. The best feature of Pig is that, it backs many relational features like Join, Group and Aggregate.

I know Pig sounds a lot more like an ETL tool and it does have many features common with ETL tools. But the advantage of Pig over ETL tools is that it can run on many servers simultaneously.

What is Apache Pig Latin?

Apache Pig create a simpler procedural language abstraction over Map Reduce to expose a more Structured Query Language (SQL)-like interface for Hadoop applications called Apache Pig Latin, So instead of writing a separate Map Reduce application, you can write a single script in Apache Pig Latin that is automatically parallelized and distributed across a cluster. In simple words, Pig Latin, is a sequence of simple statements taking an input and producing an output. The input and output data are composed of bags, maps, tuples and scalar. Learn more about Big Data and its applications from the Azure Data Engineering Course in Hyderabad.

Apache Pig Execution Modes:

Apache Pig has two execution modes:

Local Mode

In ‘Local Mode’, the source data would be picked from the local directory in your computer system. The MapReduce mode can be specified using ‘pig –x local’ command.

MapReduce Mode:

To run Pig in MapReduce mode, you need access to Hadoop cluster and HDFS installation. The MapReduce mode can be specified using the ‘pig’ command.

Apache Pig Operators:

Apache Pig Operators is a high-level procedural language for querying large data sets using Hadoop and the Map Reduce Platform. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. These operators are the main tools Pig Latin provides to operate on the data. They allow you to transform it by sorting, grouping, joining, projecting, and filtering. You can get a better understanding of the Azure Data Engineering certification.

Let’s create two files to run the commands:

We have two files with name ‘first’ and ‘second.’ The first file contains three fields: user, url & id.

The second file contain two fields: url & rating. These two files are CSV files.

The Apache Pig operators can be classified as: Relational and Diagnostic.

Relational Operators:

Relational operators are the main tools Pig Latin provides to operate on the data. It allows you to transform the data by sorting, grouping, joining, projecting and filtering. This section covers the basic relational operators.

LOAD:

LOAD operator is used to load data from the file system or HDFS storage into a Pig relation.

In this example, the Load operator loads data from file ‘first’ to form relation ‘loading1’. The field names are user, url, id.

FOREACH:

This operator generates data transformations based on columns of data. It is used to add or remove fields from a relation. Use FOREACH-GENERATE operation to work with columns of data.

FOREACH Result:

FILTER:

This operator selects tuples from a relation based on a condition.

In this example, we are filtering the record from ‘loading1’ when the condition ‘id’ is greater than 8.

FILTER Result:

JOIN:

JOIN operator is used to perform an inner, equijoin join of two or more relations based on common field values. The JOIN operator always performs an inner join. Inner joins ignore null keys, so it makes sense to filter them out before the join.

In this example, join the two relations based on the column ‘url’ from ‘loading1’ and ‘loading2’.

JOIN Result:

ORDER BY:

Order By is used to sort a relation based on one or more fields. You can do sorting in ascending or descending order using ASC and DESC keywords.

In below example, we are sorting data in loading2 in ascending order on ratings field.

ORDER BY Result:

DISTINCT:

Distinct removes duplicate tuples in a relation.Lets take an input file as below, which has amr,crap,8 and amr,myblog,10 twice in the file. When we apply distinct on the data in this file, duplicate entries are removed.

DISTINCT Result:

STORE:

Store is used to save results to the file system.

Here we are saving loading3 data into a file named storing on HDFS.

STORE Result:

GROUP:

The GROUP operator groups together the tuples with the same group key (key field). The key field will be a tuple if the group key has more than one field, otherwise it will be the same type as that of the group key. The result of a GROUP operation is a relation that includes one tuple per group.

In this example, group th

e relation ‘loading1’ by column url.

GROUP Result:

COGROUP:

COGROUP is same as GROUP operator. For readability, programmers usually use GROUP when only one relation is involved and COGROUP when multiple relations re involved.

In this example group the ‘loading1’ and ‘loading2’ by url field in both relations.

COGROUP Result:

CROSS:

The CROSS operator is used to compute the cross product (Cartesian product) of two or more relations.

Applying cross product on loading1 and loading2.

CROSS Result:

LIMIT:

LIMIT operator is used to limit the number of output tuples. If the specified number of output tuples is equal to or exceeds the number of tuples in the relation, the output will include all tuples in the relation.

LIMIT Result:

SPLIT:

SPLIT operator is used to partition the contents of a relation into two or more relations based on some expression. Depending on the conditions stated in the expression.

Split the loading2 into two relations x and y. x relation created by loading2 contain the fields that the rating is greater than 8 and y relation contain fields that rating is less than or equal to 8.

Got a question for us? Please mention them in the comments section and we will get back to you.

Related Posts:

Operators in Apache Pig – Diagnostic Operators

Steps to Create UDF in Apache Pig

Introduction to Pig

Recommended videos for you

Secure Your Hadoop Cluster With Kerberos

Big Data – XML Parsing With MapReduce

Streaming With Apache Spark and Scala

Introduction to Apache Solr-1

Apache Spark Will Replace Hadoop ! Know Why

is-it-the-right-time-for-me-to-learn-hadoop-find-out.jpg

Is It The Right Time For Me To Learn Hadoop ? Find out.

When not to use Hadoop

Real-Time Analytics with Apache Storm

Introduction to Big Data TDD and Pig Unit

MapReduce-Tutorial-What-is-MapReduce-Hadoop-MapReduce-Tutorial-Edureka.jpeg

MapReduce Tutorial – All You Need To Know About MapReduce

Advanced Security In Hadoop Cluster

5 Scenarios: When To Use & When Not to Use Hadoop

Python-for-Big-Data-Analytics-1-Python-Hadoop-Tutorial-for-Beginners-Python-Tutorial-Edureka.jpeg

Python for Big Data Analytics

Bulk Loading Into HBase With MapReduce

Apache-Hadoop-Tutorial-Hadoop-Tutorial-For-Beginners-Big-Data-Hadoop-Hadoop-Training-Edureka.jpeg

Hadoop Tutorial – A Complete Tutorial For Hadoop

boost-your-data-career-with-predictive-analytics-learn-how.jpg

Boost Your Data Career with Predictive Analytics! Learn How ?

Webinar: Introduction to Big Data & Hadoop

Hadoop Cluster With High Availability

Hive-Tutorial-1-Hive-Tutorial-for-Beginners-Understanding-Hive-In-Depth-Edureka.jpeg

Hive Tutorial – Understanding Hive In Depth

Distributed Cache With MapReduce

Recommended blogs for you

Spark Java Tutorial : Your One Stop Solution to Spark in Java

Azure Data Factory Vs Databricks

Big Data Characteristics: Know the 5’Vs of Big Data

Steps to Create UDF in Apache Pig

How to Create a Pipeline in Azure Data Factory Step-by-Step

Spark SQL Tutorial – Understanding Spark SQL With Examples

What is Azure Data Factory – Here’s Everything You Need to Know

Apache Spark Lighting up the Big Data World

Spark Tutorial: Real Time Cluster Computing Framework

Applying Hadoop with Data Science

Top 50+ Apache Spark Interview Questions and Answers for 2025

Apache Pig Installation on Linux

Jobs In Hadoop

PySpark MLlib Tutorial : Machine Learning with PySpark

Introduction to Spark with Python – PySpark for Beginners

Introduction of Hadoop Architecture

Hive Data Models: Designing Efficient Data Structures

Big Prospects for Big Data

Everything About Cloudera Certified Administrator for Apache Hadoop (CCAH)

Hadoop Admin Responsibilities

Comments

14 Comments

Sudip Saha says:
Apr 2, 2018 at 7:58 am GMT
What should be the laptop configuration(RAM capacity) for installing hadoop and its ecosystem tools.???
Reply
- EdurekaSupport says:
  Apr 10, 2018 at 9:55 am GMT
  Hey Sudip, having 8GB of RAM would be the minimum requirement. However to run it properly, you will need to have 16GB of processing power in your system. Cheers :)
  Reply
varun says:
Jun 15, 2016 at 7:31 am GMT
Apache Pig language is quiet technical
Reply
pradeep says:
Jun 7, 2016 at 6:41 pm GMT
I am trying to search, how many relation I have created till now..Is there command ?
Reply
pradeep thatisetty says:
Sep 14, 2015 at 5:26 pm GMT
Within the Apache Pig Execution Modes section in the local mode it is mentioned that The MapReduce mode can be specified using ‘pig –x local’ command. Should it be Local mode can be specified using ‘pig –x local’ command.
Reply
bindu thimmapuram says:
Aug 18, 2015 at 9:50 pm GMT
How do we achieve the functionality of left or right joins using pig?
Reply
- bindu thimmapuram says:
  Aug 18, 2015 at 10:28 pm GMT
  Well I got the answer by myself….Details of joins is available at
  Reply
Sushobhit Rajan says:
Oct 5, 2014 at 10:44 am GMT
Hi Nice bog, but if you will add UDF also it will be a very good place to come back again and again for revision.
Reply
- EdurekaSupport says:
  Oct 7, 2014 at 5:34 am GMT
  Thanks Sushobhit!! You read about UDF here: https://www.edureka.co/blog/apache-pig-udf-store-functions/
  Reply
  - Sushobhit Rajan says:
    Oct 9, 2014 at 12:31 pm GMT
    Thanks :)
    Reply
Swagata says:
Jun 14, 2014 at 7:33 pm GMT
For completeness, would like example of DISTINCT, ORDER and STORE as well
Reply
- EdurekaSupport says:
  Jun 20, 2014 at 11:43 am GMT
  Hi Swagata thanks for visiting. We have updated the blog with the requested information.
  Reply
  - Sharad says:
    Mar 22, 2016 at 10:04 am GMT
    Please add Flatten as well…
    Reply
Samarjit Mahapatra says:
Jan 10, 2014 at 5:17 pm GMT
Good one
Reply

Join the discussionCancel reply

REGISTER FOR FREE WEBINAR

webinar_success

Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP

Operators in Apache Pig: Part 1- Relational Operators

edureka.co