Published on Feb 13,2018
10.4K Views
Email Post

This is the 2nd post in series of Apache Pig Operators. This post is about the ‘Diagnostic Operators’ in Apache Pig. You can also refer to our previous post on Relational Operators for more information.

Let’s create two files to run the commands. We have two files with name ‘first’ and ‘second.’ The first file contain three fields: user, url & id.

Apache pig - 2

The second file contain two fields: url & rating. These two files are CSV files.

Apache Pig - 3

Diagnostic Operators:

DUMP:

The DUMP operator is used to run Pig Latin statements and display the results on the screen. In this example, the operator prints ‘loading1’ on to the screen.

Apache Pig - 4

DUMP Result:

Apache Pig - 5

DESCRIBE:

 Use the DESCRIBE operator to review the schema of a particular relation. The DESCRIBE operator is best used for debugging a script.

Apache Pig - 6

ILLUSTRATE:

ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it comes to debugging a script. This command alone might be a good reason for choosing Pig over something else.

Apache pig - 7

EXPLAIN:

The EXPLAIN operator prints the logical and physical plane.

Apache pig - 7

Apache pig - 8

Improvements in Apache Pig 0.12.0

0.12.0 is the current version of Apache Pig available. This release include several new features such as ASSERT operator, IN operator, CASE operator.

Assert Operator:

An Assert operator can be used for data validation. For example, the following script will fail if any value is a negative integer:

a = load ‘something’ as (a0: int, a1: int);

assert a by a0 > 0, ‘a can’t be negative for reasons’;

IN Operator:

Previously, Pig had no support for IN operators. To imitate an IN operation, users had to concatenate several OR operators, as shown in below example:

a = LOAD ‘1.txt’ USING PigStorage (‘,’) AS (i:int);

b = FILTER a BY

(i == 1) OR

(i == 22) OR

(i == 333) OR

(i == 4444) OR

(i == 55555)

Now, this type of expression can be re-written in a more compressed manner using an IN operator:

a = LOAD ‘1.txt’ USING PigStorage (‘,’) AS (i:int);

b = FILTER a BY i IN (1, 22, 333, 4444, 55555);

CASE Expression:

Earlier, Pig had no support for a CASE statement. To mimic it, users often use nested bincond operators. Those could become unreadable when there were multiple levels of nesting. Following is an example of the type of CASE expression that Pig currently supports:

Case_operator = FOREACH foo GENERATE (

CASE i % 3

WHEN 0 THEN ‘3n’

WHEN 1 THEN ‘3n+1’

ELSE ‘3n+2’

END

);

Got a question for us? Please mention them in the comments section and we will get back to you. 

Related Posts: 

Operators in Apache Pig – Relational Operators

Steps to Create UDF in Apache Pig 

Introduction to Pig

About Author
Jayakrishna Kalva
Published on Feb 13,2018

Share on

Browse Categories

Comments
14 Comments