Connecting to Hive on HDInsight 3.6 using Talend Cloud

Overview

This article shows you how to connect to Hive running on Microsoft Azure HDInsight by creating Jobs in Talend Studio, publishing them to Talend Cloud, and then running them on the Talend Cloud platform.

 

Prerequisites

 

Configuring the Hadoop Cluster Connection

  1. In Studio, navigate to Repository > Metadata > Hadoop Cluster, double-click your connection (hdi), then click Next.

  2. In the Update Hadoop Cluster Connection - Step 2/2 window, configure the WebHCat configuration, HDInsight configuration, and Azure Storage configuration details, as shown below.

  3. Click Export as context. Notice that the parameters are prefixed with context. Exporting the connection properties as a context automatically creates a Context Group in the Talend Studio Repository, with a user-defined name for the group. In this instance, the context group created is hdicluster.

    prep_1_hdi_metadata.png

     

  4. Open the hdicluster context group.

  5. Review the list of property names exported as variables, and the values of those properties. Notice the Default heading of the Value column; this is the context used to define the variables.

    An advantage of creating context groups is being able to pass different values for context variables, depending on the context environment (Development/Test/Production), while running the Job. Another advantage is that you can use the same context group for different Jobs. For more information on context groups, see Creating the context group and contexts in the Talend Help Center.

    prep_2_context_group.png

     

Creating a Job to fetch data from HDInsight Hive

Now that you've created the cluster connection metadata and context group, you are ready to create a Standard Job that connects to the Hive table on Azure HDInsight; then you'll run the Job as a task on Talend Cloud platform.

  1. Create a sample Job using the tHiveConnection, tHiveInput, and tLogRow components.

    1_job_designer.png

     

  2. Configure the tHiveConnection component.

    Note: The WebHCat configuration, HD Insight configuration, and Windows Azure Storage configuration property values are configured from the context group, hdicluster, you created in the Configuring the Hadoop Cluster Connection section of this article.

    • Property Type: select Built-In from the drop-down menu
    • Distribution: select Microsoft HD Insight from the drop-down menu
    • Connection: enter the Hive database name

      2_hiveConnection.png

       

  3. Configure the tHiveInput component.

    • Schema: select Built-In from the drop-down menu
    • Table Name: enter the name of the Hive table used to fetch the data
    • Query Type: select Built-In from the drop-down menu

      Note: The Guess Schema button retrieves the schema of the given table name, and uses the Guess Query to create a query from the schema automatically, then adds the query into the Query section.

    • Query: contains automatically generated query from the Guess Query, alternatively you can enter a custom query on the Hive table

    3_hiveInput.png

     

  4. Use the tLogRow component to print the query output to the console.

  5. Click Run to execute the Job and ensure that it connects successfully.

 

Connecting Studio to Talend Cloud

Before publishing the Job to Talend Cloud, you must configure and verify the connection settings.

  1. In Studio, navigate to Window > Preferences > Talend > Talend Cloud.
  2. Enter the Talend Cloud Account Username and Account Password.
  3. Select the Advanced check box, then verify or select the connection URL to the relevant Talend Cloud.

  4. Click Test Connection to verify that the settings are correct. You should see a Service available message if your connection is successful.

    9_talendCloudConnection.png

     

  5. Click OK to close the Preferences dialog box.

 

Publishing the Job to Talend Cloud

This section shows you how to use the Talend Studio GUI to publish to the cloud. Note that you can automate this process with CI if needed.

  1. In the Repository view, right-click the Job and select Publish to Cloud.

    cloud-publish-option.png

    The Publish to Cloud window appears. The Last Cloud Version dialog box is empty unless you previously published the Job with a different version number.

     

  2. Enter the version of your Job in the Publish with Version field.

  3. Select the workspace where you want to publish the Job from the Workspace pull-down menu.

  4. Select the Export Artifact Screenshot and Update corresponding Job task check boxes according to your requirements. If you have published the Job before, an option to update the Job flow appears. For more information on these options, see the Publishing to Talend Cloud page in the Talend Data Fabric Studio User Guide.

    4_publishTC.png

     

  5. Click Finish to publish the Job to Talend Cloud.

     

 

Running the Job on Talend Cloud Engine

  1. To run the Job on Talend Cloud Engine, log in to your Talend Cloud account and go into the Management Console app.

  2. Locate the Job you published, by navigating to Management > Environment > Workspace > Artifacts (choose the same environment and workspace you selected when you published the Job from the Talend Studio).

    10_tcdashboard.png

     

  3. Click the Artifacts link to see a list of Jobs published as artifacts.

    5_artifactsList.png

     

  4. Select the Tasks icon (below the Artifacts link) to view the corresponding task related to the artifact.

    6_tasksList.png

     

  5. Click the task to open the details page. It contains the following information:

    • The task title/name (editable by using the pencil icon)
    • Execution – defining run type (manual/ scheduled), runtime (cloud engine/ remote engine)
    • Configuration – containing the context variables defined for the context
    • Options – Run Now, Copy, and Delete

    7_taskDetails.png

     

  6. Explore the Run History section. In this example, V1 of the Job was run on the Cloud Engine, and V2 of the Job was run on the Remote Engine. For more information on Cloud Engines, Remote Engines, and managing environments, see the Talend Cloud Management Console User GuideManaging environments page.

    7_taskDetails-RunHistory.png

     

  7. Expand each run, then click the View Logs button to view the logs generated for each run.

    8_logs.png

     

Conclusion

It is not difficult to leverage Talend Cloud to build ingestion Jobs with Microsoft Azure HDInsight as a target. The most important aspect of building such a Job is to configure the connection details properly, and have the correct port numbers open for the communication to happen between Talend Cloud and Microsoft Azure HDInsight.

If you do not want data to transit through the Talend Cloud Engine, then you can provision a Remote Engine in your own subnet and have Talend Cloud manage the deployment and execution of the Job. The Job then runs on your Remote Engine, and the data is local to your private subnet.

Version history
Revision #:
18 of 18
Last update:
‎04-14-2019 01:54 AM
Updated by: