Talend Big Data Tutorial - Running Hadoop Jobs In TOS

Big Data For ETL and Data Warehousing (10 Blogs)

The recent tech market has seen a lot of revolution and changes taking place. It is no more a news that popularity of open source software has shot up with the growing interest in big data and analytics. Among all the open source ETL software present in the market, Talend is quite popularly used. As already discussed, Talend provides various open source integration software and services for big data, cloud storage, data integration etc. In this Talend Big Data Tutorial blog, I will talk all about how you can use Talend with various Big Data technologies like HDFS, Hive, Pig etc. You can get a better understanding of the concepts from the Hadoop Training.

Following are the topics that I will be discussing in this Talend Big Data Tutorial blog:

Big Data
Talend For Big Data
Introduction To Big Data Components
TOS Installation
How Talend Makes Working With Big Data Easier?

To get a better idea of how Talend works with big data, let’s start with the basics of big data itself. You may also go through this recording of Talend Big Data Tutorial where our Talend Training experts have explained the topics in a detailed manner with examples.

Talend Big Data Tutorial | Edureka

This video will help you in understanding the basic concepts of Talend and getting familiar with the Talend Open Studio for Big Data which is an open-source software provided by Talend to easily communicate with Big Data technologies like HDFS, Hive, Pig, etc.

Big Data – Talend Big Data Tutorial

Big Data is the data sets that are extremely large and complex and can’t be processed using any conventional data management tool. These huge sets of data can be present in structured, semi-structured or unstructured format. These are generally the streams of data which can be composed of auto-generated reports, logs, results of customer behavior analysis or combination of various data sources. Following diagram shows the main features of Big Data. They are more popularly known as the 5 V’s of Big Data.

To analyze this kind of humongous data sets you need the distributed computing power of more than thousand computers which can parallelly analyze this data and store the results centrally. Hadoop, an open source software framework, fulfills this requirement perfectly. It is a distributed file system which splits up the gathered information into a number of data blocks which are in turn distributed across multiple systems on the network. It provides an enormous storage for almost all data types, immense processing ability along with the power of handling virtually limitless tasks or jobs executing simultaneously.

In the next section of this Talend Big Data Tutorial blog, I will be talking about how you can use big data and Talend together.

Talend For Big Data – Talend Big Data Tutorial

Talend Open Studio (TOS) for big data is built on the top of Talend’s data integration solutions. It is an open source software and provides an easy to use graphical development environment to the users. It is a powerful tool which leverages the Apache Hadoop Big Data platform and helps users to access, transform, move and synchronize the big data. It makes the user’s interaction with big data sources and other targets really simple as they don’t have to learn or write any complicated code to work with it.

All you need to do is configure the big data connection and then perform simple drag and drop. The Talend Open Studio (TOS) for Big Data, at the back end, will automatically generate the underlying code. Now you can easily deploy them as a service or stand-alone Job which natively runs on the big data cluster like HDFS, Hive, Pig etc. The best way to become a Data Engineer is by getting Microsoft Azure Data Engineering Certification Course (DP-203).

Following is a pictorial representation of the functional architecture of Talend big data.

But, before I introduce Talend Open Studio, let me first explain a little about HDFS and MapReduce and how they work without Talend.

Introduction To Big Data Components – Talend Big Data Tutorial

Hadoop, as mentioned is a powerful tool for handling Big Data. But have you wondered how it manages to handle these huge datasets? Well, Hadoop is powered by two of its core modules which handle the big data quite efficiently. They are:

HDFS (Hadoop Distributed File System)
MapReduce

Get further details about Big Data from the Hadoop training in Atlanta.

Let’s talk about them one by one:

HDFS (Hadoop Distributed File System)

HDFS is the file management system of Hadoop platform which is used to store data across multiple servers in a cluster. In this, the datasets are broken down into a number of blocks and are distributed across various nodes throughout the cluster. On top of it, to maintain the data durability, HDFS keeps replications of these data blocks in various nodes. Thus, in case, one node fails, other live nodes will still hold a copy of the data block.

MapReduce

MapReduce is a data processing framework of Hadoop. It is used to create applications which can take advantage of the different files stored in a distributed environment such as HDFS. A MapReduce application mainly has two functions which run as tasks on various nodes in a cluster. Those two functions are:

Mappers: These functions reads and processes the blocks of data to generate key-value pairs as intermediate outputs. These outputs are then fed as input to the Reducers.
Reducers: These functions receive the key-value pair outputs from multiple Mapper functions. These key-value pairs are then aggregated into a smaller set of key-value pairs which are then counted as the final outputs.

Let’s see a simple example of how we can extract the unique values from a file using HDFS and MapReduce:

Here we have a text file in which some words are repeated.

Using MapReduce we will try to count the number of times those words appear in the file and store the output in a new file. For this, you need to have good knowledge of Java programming language.

package co.edureka.mapreduce;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
 
public class WordCount
{
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
public void map(LongWritable key, Text value,Context context) throws IOException,InterruptedException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
}
}
 
public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException,InterruptedException {
int sum=0;
for(IntWritable x: values)
{
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}
}
 
public static void main(String[] args) throws Exception {
 
Configuration conf= new Configuration();
Job job = new Job(conf,"Counting Unique Words In A File");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the output path automatically from hdfs so that we don't have to delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
//exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Woohh!! Now that's a lot of coding you have to do. Especially if you are not familiar with coding, it can become a big headache. Also, it will take a lot of effort and time in coding and debugging this program. But no need to worry! Talend can save you from writing all this code and make your work much easier as you only need to drag and drop the components in the Talend's workspace. At the backend, Talend will automatically generate this code for you. But for this, you need to have Talend for Big Data installed.

In the next section of this blog on Talend Big Data Tutorial, I will show a step by step installation of TOS for BD.

TOS Installation – Talend Big Data Tutorial

STEP 1: Go to https://www.talend.com/products/talend-open-studio.

STEP 2: Click on ‘Download Free Tool’.

STEP 3: If the download doesn't start, click on 'Restart download'.

STEP 4: Now extract the zip file.

STEP 5: Now go into the extracted folder and double-click on the TOS_BD-linux-gtk-x86_64 file.

STEP 6: Let the installation finish.

STEP 7: Click on ‘Create a new project’ and specify a meaningful name for your project.

STEP 8: Click on ‘Finish’ to go to the Open Studio GUI.

STEP 9: Right-click on the Welcome tab and select ‘Close’.

STEP 10: Now you should be able to see the TOS main page.

Big Data Components In Talend - Talend Big Data Tutorial

Talend provides a wide range of components, which you can use to interact with HDFS and MapReduce.

In this Talend big data tutorial blog, I will explain the most important components belonging to the Big Data family: You can even check out the details of Big Data with the DP 203 Certification Course.

HDFS

tHDFSConnection: This component helps in connecting to a given HDFS so that the other Hadoop components can reuse that connection to communicate with the HDFS.
tHDFSPut: This component helps in copying the files from an user-defined directory and paste them into the HDFS and is also capable of renaming them.
tHDFSGet: This component helps in copying the files from HDFS, and paste them into an user-defined directory and is also capable of renaming them.
tHDFSInput: This component helps in extracting the data in an HDFS file so that other components can process it.
tHDFSOutput: This component helps in transferring the data flows into a given HDFS file system.

Hive

tHiveConnection: This component helps in establishing a Hive connection so that it can be reused by other Hive components.
tHiveInput: This component helps in executing the select queries which extract the corresponding data and sends the data to the component that follows.
tHiveLoad: This component helps in writing the data of different formats into a given Hive table or in exporting data from a Hive table to a particular directory.
tHiveCreatetable: This component helps in connecting to the Hive database being used and creates a Hive table which will be dedicated to data of the specified format.
tHiveClose: This component helps in closing the connection to a Hive database.

Pig

tPigLoad: This component helps in loading the original input data to an output stream with a single transaction after the data is validated.
tPigMap: This component helps in transforming the data from single or multiple sources and then routing it to single or multiple destinations.
tPigAggregate: This component helps in adding one or more additional columns to the output of the grouped data to generate the data that can be used by Pig.
tPigJoin: This component helps in executing the inner joins and outer joins of two files based on join keys in order to create the data to be used by Pig.

How Talend Makes Working With Big Data Easier? - Talend Big Data Tutorial

Now let us try to execute the same program using Talend and see how Talend helps in executing this program easily.

STEP 1: Open Talend Studio For Big Data and create a new job.

STEP 2: Add tHDFSConnection Component and provide the necessary details in its component tab to set up the connection.

STEP 3: Now add a tHDFSPut component in order to upload your file on HDFS. Go to its component tab specify the necessary details as shown:

STEP 4: Now add rest of the components and link them together as shown:

STEP 5: Go to the component tab of tHDFSInput component and enter the required details.

STEP 6: In the component tab of tNormalize component, specify the details as shown:

STEP 7: Go to the component tab of tAggregate component provide the details as shown:

STEP 8: Double-click on the tMap component and in the pop-up window, map the input table with the required output table as shown:

STEP 9: In the component tab of the tHDFSOutput component specify the required details.

STEP 10: From the Run tab, click on run to execute the job. A successfully executed job will look like below:

STEP 11: It will give you the output on HDFS:

So, this brings us to the end of this blog on Talend Big Data Tutorial. I tried my best to keep the concepts short and clear. Hope it helped you in understanding Talend and how it works with big data.

Now that you are through with this blog, check out the Hadoop training in Chicago by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe.

Got a question for us? Please mention it in the comments section and we will get back to you.