Talend Open Studio Tutorials

 

Tutorial 12: Writing and Reading Data in HDFS

 

 

Writing and reading data in HDFS
 
In this tutorial, generate random data and write them to HDFS. Then, read the data from HDFS, sort them and display the result in the Console.
 

This tutorial uses Talend Data Fabric Studio version 6 and a Hadoop cluster: Cloudera CDH version 5.4.

 

1.  Create a new standard Job

  1. Ensure that the Integration perspective is selected.
  2. To ensure that the Hadoop cluster connection and the HDFS connection metadata have been created in the Project Repository, expand Hadoop Cluster.
  3. In the Repository, expand Job Designs, right-click Standard, and click Create Standard Job. In the Name field of the New Job wizard, type ReadWriteHDFS. In the Purpose field, type Read/Write data in HDFS and in the Description field, type Standard job to write and read customers data to and from HDFS and click Finish.

The Job opens in the Job Designer.

 

  1. Add and configure a tRowGenerator component to generate random customer data
  2. To generate random customer data, in the Job Designer, add a tRowGenerator
  3. To set the schema and function parameters for the tRowGenerator component, double-click the tRowGenerator_1
  4. To add columns to the schema, click the [+] icon three times and type the column names as CustomerID, FirstName, and LastName. Next, you will configure the attributes for these fields.
  5. To change the Type for the CustomerID column, click the Type field and click Integer and set the Functions field of the three columns to random(int,int), TalendDataGenerator.getFirstName(), and TalendDataGenerator.getLastName() respectively.
  6. In the table, select the CustomerID column, then, in the Functions parameters tab, set the max value to 1000.
  7. In the Number of Rows for RowGenerator field, type 1000, and click OK to save the configuration.

 

3.  Write data to HDFS

For this, you will create a new tHDFSOutput component that reuses the existing HDFS metadata available in the Project Repository.

  1. From the Repository, under Metadata > HadoopCluster > MyHadoopCluster > HDFS, click MyHadoopCluster_HDFS and drag it to the Job Designer.
  2. In the Components list, select tHDFSOutput and click OK.
  3. Create a flow of data from the tRowGenerator_1 component to the MyHadoopCluster_HDFS component by linking the two components with the Main row and then double-click the MyHadoopCluster_HDFS component to open the Component view.

Note that the component is already configured with the pre-defined HDFS metadata connection information.

  1. In the File Name box, type “/user/student/CustomersData” and in the Action list, select Overwrite.

The first subjob to write data to HDFS is now complete. It takes the data generated in the tRowGenerator you created earlier, and writes it to HDFS using a connection defined using metadata.

 

4.  Read data from HDFS

Next, you will build a subjob to read the customer data on HDFS, sort them, and display them in the console.  To read the customer data from HDFS, you will create a new tHDFSInput component that reuses the existing HDFS metadata available in the Project Repository.

  1. From the Repository, under Metadata > HadoopCluster > MyHadoopCluster > HDFS, click MyHadoopCluster_HDFS and drag it to the Job Designer.
  2. In the Components list, select tHDFSInput and click OK.
  3. To open the component view of the MyHadoopCluster_HDFS input component, double-click the MyHadoopCluster_HDFS input component.

Note that the component is already configured with the pre-defined HDFS metadata connection information.

  1. In the File Name box, type “/user/student/CustomersData”.

 

5.  Specify the schema in the MyHadoopCluster_HDFS input component to read the data from HDFS

  1. To open the schema editor, in the Component view of the MyHadoopCluster_HDFS input component, click Edit schema.
  2. To add columns to the schema, click the [+] icon three times and type the columns names as CustomerID, FirstName, and LastName.
  3. To change the Type for the CustomerID column, click the Type field and click Integer.

Note: This schema is the same as in tRowGenerator and tHDFSOutput. You can copy it from either of those components and paste it in this schema.

  1. Connect the tRowGenerator component to the MyHadoopCluster_HDFS input component using the OnSubjobOk trigger.

 

6.  Sort data in the ascending order of customer ID, using the tSortRow component

  1. Add a tSortRow component and connect it to the MyHadoopCluster_HDFS input component with the Main
  2. To open the Component view of the tSortRow component, double-click the component.
  3. To configure the schema, click Sync columns.
  4. To add new criteria to the Criteria table, click the [+] icon and in the Schema column, type CustomerID. In the sort num or alpha? column, select num and in the Order asc or desc? column, select asc.

 

7.  Display the sorted data in the console using a tLogRow component

  1. Add a tLogRow component and connect it to the tSortRow component with the Main
  2. To open the Component view of the tLogRow component, double-click the component.
  3. In the Mode panel, select Table.

Your Job is now ready to run. First, it generates data and writes it to HDFS. Then, it reads the data from HDFS, sorts it, and displays it in the console.  

 

8.  Run the Job and observe the result in the console

  1. To run the Job, in the Run view, click Run.

The sorted data is displayed in the console.

 

← PREVIOUS TUTORIAL   |   NEXT TUTORIAL  →