Adding a Dataset from HDFS with Talend Data Preparation 6.3.1

Talend Version        6.3.1

Summary

 
Additional Versions  
Product Data Preparation
Component Big Data
Problem Description

You can use Talend Data Preparation version 2.0.0 with Talend Data Fabric 6.3.1.

 

In Talend Data Preparation start and stop sequences, you will see that Data Preparation and its dependencies should be started in the following order:

  1. Apache Zookeeper
  2. Apache Kafka
  3. MongoDB
  4. Talend Administration Center
  5. Talend Dictionary Service
  6. Talend Data Preparation

When you start the components in this order, no entry is available in the Data Preparation Web UI for you to add a Dataset from HDFS. If you click Datasets > Add Dataset, you will see only the options shown here:

 

sanshdfs.png

 

Problem root cause In Talend 6.3.1, the Components Catalog Server is not started by default, nor is it configured to allow you to add a Dataset from HDFS. This is the reason why a From HDFS entry is not available in the Data Preparation Web UI.
Solution or Workaround

After ensuring that the Components Catalog Server has been installed, use the following instructions to enable the Components Catalog server for use with Talend Data Preparation in a Big Data context (which will add a Dataset > From HDFS entry to the menu). For details, see Configuring the Components Catalog server.

  1. Navigate to the Components_Catalog_Path/config/application.properties file and open it for editing.
  2. Add the following line to the file:

    hadoop.conf.dir=path_to_Hadoop_configuration_directory

    where path_to_Hadoop_configuration_directory is a directory that contains Hadoop configuration files (such as core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml) related to the Hadoop Cluster.

  3. Start the Components Catalog server using the following start sequence:

    1. Apache Zookeeper
    2. Apache Kafka
    3. MongoDB
    4. Talend Administration Center
    5. Talend Dictionary Service
    6. => Components Catalog Server <=
    7. Talend Data Preparation

    Now you will be able to add a Dataset from HDFS using the Data Preparation Web UI:

    avec hdfs2.png

     

  4. Click From HDFS, then provide details to add the Dataset from HDFS:

    datasetconfig2.png

     

    The path parameter is set to hdfs://arthur.cdh573:8020/user/cloudera/hb_test.csv, where:
    • arthur.cdh573 is the host where the NameNode server is running
    • 8020 is the listening NameNode port for the HDFS protocol

     

    In this example, the hb_test.csv file is stored in the /user/cloudera HDFS directory, and the file content is:

    1,one
    2,two
  5. Click ADD DATSET to create a Dataset in Data Preparation from the hb_test.csv file (stored in HDFS):

    resultat.png

     

JIRA ticket number  
Version history
Revision #:
45 of 45
Last update:
‎12-06-2017 02:08 PM
Updated by: