Big Data and Hadoop (168 Blogs) Become a Certified Professional

Operators in Apache Pig: Part 2- Diagnostic Operators

Last updated on May 22,2019 12.7K Views

This is the 2nd post in series of Apache Pig Operators. This post is about the ‘Diagnostic Operators’ in Apache Pig. You can also refer to our previous post on Relational Operators for more information.

Let’s create two files to run the commands. We have two files with name ‘first’ and ‘second.’ The first file contain three fields: user, url & id.

Apache pig - 2

The second file contain two fields: url & rating. These two files are CSV files.

Apache Pig - 3

Diagnostic Operators:

DUMP:

The DUMP operator is used to run Pig Latin statements and display the results on the screen. In this example, the operator prints ‘loading1’ on to the screen.

Apache Pig - 4

DUMP Result:

Apache Pig - 5

DESCRIBE:

 Use the DESCRIBE operator to review the schema of a particular relation. The DESCRIBE operator is best used for debugging a script.

Apache Pig - 6

ILLUSTRATE:

ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it comes to debugging a script. This command alone might be a good reason for choosing Pig over something else.

Apache pig - 7

EXPLAIN:

The EXPLAIN operator prints the logical and physical plane.

Apache pig - 7

Apache pig - 8

Improvements in Apache Pig 0.12.0

0.12.0 is the current version of Apache Pig available. This release include several new features such as ASSERT operator, IN operator, CASE operator.

Assert Operator:

An Assert operator can be used for data validation. For example, the following script will fail if any value is a negative integer:

a = load ‘something’ as (a0: int, a1: int);

assert a by a0 > 0, ‘a can’t be negative for reasons’;

IN Operator:

Previously, Pig had no support for IN operators. To imitate an IN operation, users had to concatenate several OR operators, as shown in below example:

a = LOAD ‘1.txt’ USING PigStorage (‘,’) AS (i:int);

b = FILTER a BY

(i == 1) OR

(i == 22) OR

(i == 333) OR

(i == 4444) OR

(i == 55555)

Now, this type of expression can be re-written in a more compressed manner using an IN operator:

a = LOAD ‘1.txt’ USING PigStorage (‘,’) AS (i:int);

b = FILTER a BY i IN (1, 22, 333, 4444, 55555);

CASE Expression:

Earlier, Pig had no support for a CASE statement. To mimic it, users often use nested bincond operators. Those could become unreadable when there were multiple levels of nesting. Following is an example of the type of CASE expression that Pig currently supports:

Case_operator = FOREACH foo GENERATE (

CASE i % 3

WHEN 0 THEN ‘3n’

WHEN 1 THEN ‘3n+1’

ELSE ‘3n+2’

END

);

Got a question for us? Please mention them in the comments section and we will get back to you. 

Related Posts: 

Operators in Apache Pig – Relational Operators

Steps to Create UDF in Apache Pig 

Introduction to Pig

Comments
14 Comments
  • Sindhuja Devaraj says:

    is there a command to join two files without duplicate columns?

  • teja says:

    very good blog.Easy to understand ! thank u Edureka!

    • EdurekaSupport says:

      Hi Teja,
      Thank you so much for your great feedback. We hope that you will find our blog useful in future as well.
      Keep visiting the Edureka Blog page for latest posts on this link:https://www.edureka.co/blog/

  • prcahi says:

    Hi All,

    I need to put IF, then IF, ELSE IF conditions, how can I do that in PIG. Please let me know.Thanks in advance.

  • bindu thimmapuram says:

    Nice Blog!! simple and to the point

    • EdurekaSupport says:

      Hi Bindu,
      Thank you for your positive feedback. We hope that you will find our blog useful in future as well. Keep visiting the Edureka Blog page for latest posts on this link:
      https://www.edureka.co/blog/

  • bhaskar says:

    if i want to use In clause with matches is there a way?

  • devinder says:

    what is siginificance of output given by Explain command. Please give details with example .

    • EdurekaSupport says:

      Hi Devinder, we use
      the EXPLAIN operator to review the logical, physical, and map reduce
      execution plans that are used to compute the specified relationship.
      If no script is given, the logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backend-independent
      optimizations (such as applying filters early on) also applies. The physical plan shows how the logical operators are translated to backend-specific physical operators. Some backend optimizations also applies. The mapreduce plan shows how the physical operators are grouped into map reduce jobs. If a script without an alias is specified, it will output the entire execution graph (logical, physical, or map reduce). If a script with a alias is specified, it will output the plan for the given alias.

  • devinder says:

    I am using Apache Pig version 0.12.0-cdh5.2.1 and Illustrate is giving error .
    ERROR 2997: Encountered IOException. Exception
    seems it is noty supported.

    • EdurekaSupport says:

      Hi Devinder, can you please share more details about the error. Meanwhile can you try to run this command in local mode of Pig and check.

  • Sushobhit Rajan says:

    Nicely explained. If any new updates are coming for this page, please let me know.

    • EdurekaSupport says:

      Thanks Sushobhit! You can get regular updates by subscribing to our blog. You can use the Subscription form on the right side of this post.

Join the discussion

Browse Categories

webinar REGISTER FOR FREE WEBINAR
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP

Subscribe to our Newsletter, and get personalized recommendations.