Please explain the difference between cogroup and full outer join in spark.
Jul 14, 2019 7,931 views

## 1 answer to this question.

Please go through the below explanation :

Full Outer Join

Full outer joins in RDD is same as full outer join in SQL.

• FULL JOIN returns all matching records from both tables whether the other table matches or not.
• FULL JOIN can potentially return very large datasets.
• FULL JOIN and FULL OUTER JOIN are the same.

Group and Co-group

The GROUP and COGROUP operators are identical but GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations.

Suppose we have one relation A like below

```A = load 'student' AS (name:chararray,age:int,gpa:float);

DUMP A;

(John,18,4.0F)

(Mary,19,3.8F)

(Bill,20,3.9F)

(Joe,18,3.8F)

B = GROUP A BY age;

DUMP B;

(18,{(John,18,4.0F),(Joe,18,3.8F)})

(19,{(Mary,19,3.8F)})

(20,{(Bill,20,3.9F)})
```

Now we are using Cogroup

Suppose we have two relations, A and B like below

```A = LOAD 'data1' AS (owner:chararray,pet:chararray);

DUMP A;

(Alice,turtle)

(Alice,goldfish)

(Alice,cat)

(Bob,dog)

(Bob,cat)

B = LOAD 'data2' AS (friend1:chararray,friend2:chararray);

DUMP B;

(Cindy,Alice)

(Mark,Alice)

(Paul,Bob)

(Paul,Jane)

X = COGROUP A BY owner, B BY friend2;

dump X;

(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})

(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})

(Jane,{},{(Paul,Jane)})
```

In the above example, the first bag is the tuples from the first relation with the matching key field. The second bag is the tuples from the second relation with the matching key field. If no tuples match the key field, the bag is empty.

answered Jul 14, 2019 by Kiran

## What's the difference between 'filter' and 'where' in Spark SQL?

Both 'filter' and 'where' in Spark SQL ...READ MORE

+1 vote

## What is the difference between rdd and dataframes in Apache Spark ?

Comparison between Spark RDD vs DataFrame 1. Release ...READ MORE

## What is the difference between persist() and cache() in apache spark?

Hi, persist () allows the user to specify ...READ MORE

## Difference between map() and mapPartitions() function in Spark.

Hi@ akhtar, Both map() and mapPartitions() are the ...READ MORE

+1 vote

+1 vote

## Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

Hi, You can create one directory in HDFS ...READ MORE