Data Preparation - Adding a database type to connect to Hive in Azure HDInsight

Introduction

Talend Data Preparation lets you make a direct connection to various types of databases. You can use this direct connection as a source to create new datasets. This article explains the procedure to add a new database type to Talend Data Preparation to connect to Hive in Azure HDInsight.

 

HDInsight is a Hadoop service offering hosted in Azure that enables clusters of managed Hadoop instances. HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on Big Data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution.

 

Environment

  • Talend Data Preparation 7.1.1
  • HDInsight cluster
  • The steps in this article are done using a Windows environment, but the process of setting up a new database type is the same on Linux machines

 

Prerequisites

  • An Azure account with permissions for an HDInsight cluster and Hive tables
  • An existing Hive table

 

Procedure

  1. Download the following JAR files:

    • hive_jdbc-*.*
    • hive-services-*.*
    • hive-metastore-*.jar
    • hive-common-*.jar
    • hive-cli-*.jar
    • hive-exec-*.jar
    • libfb303-*.jar
    • libthrift-*.jar

     

  2. Create a folder named jdbc-drivers in components_catalog_path/.m2/

    1.jpg

     

  3. Copy the JAR files to this folder:

    2.jpg

     

  4. Update the file located at components_catalog_path/config/jdbc_config.json. For this article, the jdbc_config.json file is located as shown in the image below:

     

    3.jpg

     

  5. Edit the jdbc_config.json file as shown below:

    • id: The value to be displayed on the database drop-down menu in Data Preparation
    • class: The Hive driver class for communicating with the Hive database
    • url: The HDInsight URL
    • path: List all the required JAR file details. The path follows the pattern mvn:jdbc-drivers/my_database_name/my_version.

      Sample path for Hive:

      ,
      {
          "id": "Hive HDInsight",
          "class": "org.apache.hive.jdbc.HiveDriver",
          "url": "jdbc:hive2://YOURCLUSTER.azurehdinsight.net:443/default;transportMode=http;ssl=true;httpPath=/hive2",
          "paths": [
                                      {"path": "mvn:jdbc-drivers/hive-jdbc-1.2.1000.2.6.0.3-8/6.4.0"}      ,
                                      {"path": "mvn:jdbc-drivers/hive-service-1.2.1000.2.6.0.3-8/6.4.0"},
                                      {"path": "mvn:jdbc-drivers/libthrift/0.9.3"}
                           ]
          }
  6. Log in to Data Preparation.
  7. Navigate to DATASETS.
  8. Click ADD DATASET and select Database.

    4.jpg

     

  9. You have a new option, Hive HDInsight, available for Database type. Select Hive HDInsight and provide the details below:

     

    • Dataset name: hive_dataset, or any other name you choose
    • Database type: Hive HDInsight
    • JDBC URL: jdbc:hive2://YOURCLUSTER.azurehdinsight.net:443/default;transportMode=http;ssl=true;httpPath=/hive2, where YOURCLUSTER is the name of the HDInsight cluster
    • Username: The one you use for connecting to the HDInsight cluster
    • Password: The one you use for connecting to the HDInsight cluster

      5.jpg

       

  10. Click TEST CONNECTION.

    6.jpg

     

  11. This article uses the query select * from hivesampletable, however, change Query as required and click ADD DATASET.
  12. A new dataset, hive_dataset, is created by connecting to Hive in HDInsight.

    7.jpg

 

Conclusion

Data preparation is a very powerful yet simple-to-use tool for creating datasets and preparations to deliver cleansed, structured, enriched data to business users. This article explained the process to add a database type using an example of Hive in HDInsight while working with Data Preparation on premises.

Version history
Revision #:
34 of 34
Last update:
‎08-18-2019 09:20 AM
Updated by: