In today’s data-driven world a huge amount of data is generated from various organizations, machines, and gadgets, irrespective of their sizes. For example, your mobile, each time you browse the web, some amount of data is generated. Do you know a commercial plane can generate up to 500GB of data per hour? I hope now you can imagine how large this data is! This is the reason it is known as Big Data. But all of this data is pretty much useless unless you perform ETL operations on it! Believe me, it’s certainly not an easy task. Moreover, today’s real-time and fast-paced nature of the business, adds to the need of having such a tool which can quickly and easily integrate the systems. Well, this is where Talend comes to the rescue. Through this blog on Talend Tutorial, I will explain how Talend helps to build, test, deploy, schedule and monitor this data.
But before I proceed, let me list down the topics I will be discussing today:
- What Is Talend?
- Introduction To Talend Open Studio
- TOS Installation
- TOS GUI
- Talend Job
- Talend Components and Connectors
- Context Variables
- First Job In Talend
You may also go through this recording of Talend Data Integration Tutorial where our Talend Training experts have explained the topics in a detailed manner with examples.
Talend Data Integration Tutorial | Talend Tutorial For Beginners | Edureka
What Is Talend? – Talend Tutorial
Talend is an open source software integration platform/vendor which offers data integration and data management solutions. This company provides various integration software and services for big data, cloud storage, data integration, data management, master data management, data quality, data preparation, and enterprise applications. Its headquarters are located in Redwood City, California.
Following are the some of the major features of Talend:
It is considered to be the next-generation leader in cloud and big data integration software. It provides the software that helps companies become data driven by making data more accessible, improving its quality and quickly moving it where it’s needed for real-time decision making. You can think Talend as a critical infrastructure for this data-driven world. It’s an open source approach which breaks off the traditional proprietary model by providing the powerful software solutions. It enables the flexibility to meet the needs of all the organizations. Being open source, it is backed by a huge community of the developers. Talend publishes its core module’s codes under the GNU Public License or the Apache License. From here, the developers within the community can make changes and enhance the products which in turn will benefit other Talend users.
Various products offered by Talend are:
Introduction To Talend Open Studio (TOS) – Talend Tutorial
Talend Open Studio is an open source project that is based on Eclipse RCP. It supports ETL oriented implementations and is generally provided for the on-premises deployment. It is extensively used for integration between operational systems, ETL processes and data migration. Talend Open Studio for Data Integration is designed in such a way that it can easily combine, convert and update data present at various locations across an organization. This acts as a code generator which produces data transformation scripts and underlying programs in Java. It provides an interactive and user-friendly GUI which lets you access metadata repository containing the definition and configurations for each process performed in Talend. Below is the basic architecture of Talend Open Studio.
TOS Installation – Talend Tutorial
STEP 1: Go to: https://www.talend.com/download.
STEP 2: Click on ‘Download Free Tool’.
STEP 3: Again click on ‘Download Free Tool’ to get the zip file.
STEP 4: Now extract the zip file.
STEP 5: Now go into the extracted folder and double click on TOS_DI-linux-gtk-x86_64 file.
STEP 6: Let the installation finish.
STEP 7: Click on ‘Create a new project’ and specify a meaningful name for your project.
STEP 8: Click on ‘Finish’ to go to the Open Studio GUI.
STEP 9: Right-click on the Welcome tab and select ‘Close’.
TOS GUI – Talend Tutorial
Now that you have downloaded and installed Talend Open Studio, let me give you a walkthrough of its GUI. Talend Open Studio consists of four major parts, as shown below.
The Repository collects all the technical items which can be used either to describe business models or design Jobs within Talend and displays them in a tree structure. From the Repository, you can access various Business Models, Job Designs, reusable routines, documentation as well as database connections. In other words, the Repository acts as a central store for all the elements which are necessary for any Job design or business modelling within a project.
This window further consists of the following parts:
- Workspace: Here you can lay down the designs of your Jobs as well as the business models.
- Designer Tab: This tab opens by default when you create a Job which displays the Job in a graphical mode.
- Code Tab: This tab helps you in visualizing the code and highlight the possible language errors.
Component Palette is docked at the top of the design workspace to help you draw the model corresponding to your workflow needs. Depending on your Job or the business model, you can drag and drop various technical components or shapes into your design workspace. There are more than 800 components available for you to choose from.
The configuration tabs are present in the lower half of the design window. There are various configurational tabs available in TOS. Each of these tabs opens a view which displays the properties of the current element in the workspace. Most frequently used configurational tabs are:
The Job tab provides various information about the current Job in the designer window including name, version, creation date and time etc.
The Context tab is used to set context variables and different contexts on which they will be used.
The Component tab displays all the parameters that are required to configure a component. Basically, it collects all the information that is relative to the graphical element selected in the design workspace.
The Run tab displays the progress of the execution of a Job. The logs shown here includes any start, end and error messages.
Here you might ask ‘what is a Job’, as I have already used this term quite a few times till now. So, before diving any deeper let me first give you a brief about a Talend Job.
Talend Job – Talend Tutorial
A ‘Job’ in Talend is basically a customer requirement converted into a technical process. Technically, it is a basic executable unit of any process that is built using Talend. As you already know, TOS converts everything into Java codes at the backend. In case of Jobs, each Job is converted into a single Java class. Let me show you how you can create a Job in Talend.
- Right-click on the ‘Job Designs’ in the Repository and select ‘Create job’.
- Specify a meaningful name for your Job along with the purpose and description of it and click on ‘Finish’.
- Once you finish creating a Job, you will get access to the components present in the palette. Now you can drag any component you need from the palette and drop it in the workspace.
But in order to add a component to a Job, first, you need to know what exactly are components, how you can use multiple components together and connect them. So in the next part of this Talend tutorial, I will introduce you to various components and connectors available in Talend.
Talend Components And Connectors – Talend Tutorial
Let’s start with Components.
A component is a functional piece which is used to perform a single operation in Talend. On the palette, whatever you can see all are the graphical representation of the components. You can use them with a simple drag and drop. At the backend, a component is a snippet of Java code that is generated as a part of a Job (which is basically Java class). These Java codes are automatically compiled when the Job is saved. A Talend Job may include one or more components depending on the requirement. One thing you need to know here is Talend provides more than 800 components from which you can choose from. For the ease of access, all these components are generalized to few groups or families. In this Talend tutorial blog, I will introduce you to some of the most important and frequently used components of each family.
This family provides Talend components which cover various needs like opening connections, reading and writing tables, committing transactions, performing rollback for error handling etc. More than 40 RDBMS are supported by Talend some of which are MySQL, MS SQL Server, Hive, Amazon, Azure etc. Following are some of the majorly used MySQL components:
- tMysqlConnection: This component opens a new connection to the database for a current transaction.
- tMysqlInput: This component reads a database and extracts fields based on the query.
- tMysqlOutput: This component writes, updates, makes changes or suppresses entries in a database.
- tMysqlClose: This component closes the transaction committed in the connected database.
This family groups together various components which read and write data in all types of files like Delimited, Positional, XML, Excel etc. Moreover, it also provides a number of components which help in performing various tasks like unarchiving, deleting, copying, comparing etc. This family is further divided into subfamilies like Input, Output, and Management. Few majorly used components of this family are:
- tFileInputDelimited: This component reads a given file row by row with fields separated using some specified character.
- tFileInputExcel: This component reads an Excel file (.xls or .xlsx) and extracts data line by line.
- tFileOutputXML: This component outputs the data to a XML type of file.
- tFileList: This component retrieves a set of files or folders based on a filemask pattern and iterates them.
- tFileArchive: This component zips one or more files according to the parameters defined and places the archive created in the selected directory.
This family includes all of the components that help in accessing information from the Internet, through various means like Web services, RSS flows, SCP, MOM, Emails, FTP etc. Few of the majorly used components of this family are:
- tFTPGet: This component helps in retrieving the specified files via an FTP connection.
- tFTPPut: This component copies the selected files via an FTP connection.
- tHttpRequest: This component sends an HTTP request to the server end and receives the corresponding response from the server end.
- tSendMail: This component is used to send emails and attachments to the defined recipients.
Logs & Errors
This family, groups together all the components which are dedicated to catch log information and handle Job errors. Following are the majorly used components of this family:
- tLogRow: This component allows you to write row data into the Job log file, or to the console window.
- tLogRowCatcher: This component collects the log data and encapsulates it to pass it on to the defined output.
- tWarn: This component triggers a warning often caught by the tLogCatcher component for the exhaustive log.
- tDie: This component sends a message to a tLogCatcher and allows the Job to terminate a Job, with a specified Exit Code
This family gathers different miscellaneous components covering various needs like the creation of sets of dummy data rows, buffering data, loading context variables etc. Few important components of this family are:
- tMsgBox: This component opens a dialogue box with a clickable OK button.
- tRowGenerator: This component is used to generate as many rows and fields as are required using random values which are taken from a list.
This family includes various components which help to sequence or orchestrate tasks and processing Jobs or SubJobs etc. Majorly used components from this family are:
- tLoop: This component helps in executing a task or a Job automatically, based on a loop with the specified number of iterations.
- tPrejob: This component helps in triggering a task required for the execution of a Job.
- tPostjob: This component helps in triggering a task required after the execution of a Job.
- tSleep: This component helps in implementing a time off within a Job execution.
Now that you know the components, let’s quickly take a look at the connectors or the links which help in connecting these components together in a Job.
Talend provides various types of connections to enable the communication between the components:
The Row connection deals with the actual data flow. Following are the types of Row connections supported by Talend:
- Multiple Input/Output
The Iterate connection is used to perform a loop on files contained in a directory, on rows contained in a file or on the database entries. Unlike other types of connections, the name of this Iterate link is read-only.
The Trigger connection is used to create a dependency between Jobs or SubJobs which are triggered one after the other according to the trigger’s nature. Trigger connections are generalized in two categories:
- Run if
- Run if
The Link connection can be used only with the ELT components. It is used to transfer the table schema information to the ELT mapper component in order to be used in specific DB query statements.
Metadata – Talend Tutorial
Metadata in Talend is the definitional data which basically provides information
about other data that all are managed within Talend Studio. You can find the Metadata in the Repository area of the TOS. In the Repository Metadata, you can store metadata about the various data sources that you may use. This comes in handy while developing any project as you can use these data sources later in your Jobs, just by dragging an object from the repository and dropping it in the workspace.
In the Repository, you can store metadata for various data sources like delimited files, positional file, XML files, database, FTP, Azure, Salesforce etc.
Context Variables – Talend Tutorial
Context variables are the user-defined parameters used by Talend which are passed into a Job at the runtime. These variables may change their values as the Job promotes from Development to Test and Production environment. So, once these variables are set correctly for each environment, you can execute a Job easily in any of these environments. Another use of context variables is to define the values which are commonly used within a project. You can create the context variables in three ways:
Embedded Context Variables
These context variables are embedded in the Job and are configured much like any other component parameters in the Context Tab below the Job Designer.
Repository Context Variables
These are created when context variables are used or needed in more than one Job. They are centrally maintained in the repository allowing them generally accessible.
External Context Variables
Now, I think you are ready to design your First job in Talend.
In the next section of this Talend tutorial blog, I will show you a step by step demonstration of a simple Talend Job which you can easily execute.
First Job In Talend – Talend Tutorial
Following is a demo in which first you will be establishing a connection with the database, read data from two different external excel files, merge them and then insert it into the database table. Then in a new excel file write the new table contents. Finally, close the connection once the transfer is complete.
Let’s see how to execute it, step by step:
STEP 1: In this demo, I am using external context file for database details. In order to do so, first, you need to create a context file with all the necessary database details.
STEP 2: Create a new Job. Got to its ‘Contexts’ tab and add the following details:
STEP 3: Now, add a ‘PreJob’ and a ‘tMysqlConnection’ components in the workspace and link them together as shown below. This will establish the connection with the database before the actual Job is executed. Then go to the ‘Component’ tab of ‘tMysqlConnection’ component and add the necessary details:
STEP 4: Add two ‘tFileInputExcel’ files and a ‘tMap’ component in the workspace and link them as shown.