This article shows you how to connect to Hive running on Microsoft Azure HDInsight by creating Jobs in Talend Studio, publishing them to Talend Cloud, and then running them on the Talend Cloud platform.
In Studio, navigate to Repository > Metadata > Hadoop Cluster, double-click your connection (hdi), then click Next.
In the Update Hadoop Cluster Connection - Step 2/2 window, configure the WebHCat configuration, HDInsight configuration, and Azure Storage configuration details, as shown below.
Click Export as context. Notice that the parameters are prefixed with context. Exporting the connection properties as a context automatically creates a Context Group in the Talend Studio Repository, with a user-defined name for the group. In this instance, the context group created is hdicluster.
Open the hdicluster context group.
Review the list of property names exported as variables, and the values of those properties. Notice the Default heading of the Value column; this is the context used to define the variables.
An advantage of creating context groups is being able to pass different values for context variables, depending on the context environment (Development/Test/Production), while running the Job. Another advantage is that you can use the same context group for different Jobs. For more information on context groups, see Creating the context group and contexts in the Talend Help Center.
Now that you've created the cluster connection metadata and context group, you are ready to create a Standard Job that connects to the Hive table on Azure HDInsight; then you'll run the Job as a task on Talend Cloud platform.
Create a sample Job using the tHiveConnection, tHiveInput, and tLogRow components.
Configure the tHiveConnection component.
Note: The WebHCat configuration, HD Insight configuration, and Windows Azure Storage configuration property values are configured from the context group, hdicluster, you created in the Configuring the Hadoop Cluster Connection section of this article.
Configure the tHiveInput component.
Query Type: select Built-In from the drop-down menu
Note: The Guess Schema button retrieves the schema of the given table name, and uses the Guess Query to create a query from the schema automatically, then adds the query into the Query section.
Use the tLogRow component to print the query output to the console.
Click Run to execute the Job and ensure that it connects successfully.
Before publishing the Job to Talend Cloud, you must configure and verify the connection settings.
Select the Advanced check box, then verify or select the connection URL to the relevant Talend Cloud.
This section shows you how to use the Talend Studio GUI to publish to the cloud. Note that you can automate this process with CI if needed.
In the Repository view, right-click the Job and select Publish to Cloud.
The Publish to Cloud window appears. The Last Cloud Version dialog box is empty unless you previously published the Job with a different version number.
Enter the version of your Job in the Publish with Version field.
Select the workspace where you want to publish the Job from the Workspace pull-down menu.
Select the Export Artifact Screenshot and Update corresponding Job task check boxes according to your requirements. If you have published the Job before, an option to update the Job flow appears. For more information on these options, see the Publishing to Talend Cloud page in the Talend Data Fabric Studio User Guide.
Click Finish to publish the Job to Talend Cloud.
To run the Job on Talend Cloud Engine, log in to your Talend Cloud account and go into the Management Console app.
Locate the Job you published, by navigating to Management > Environment > Workspace > Artifacts (choose the same environment and workspace you selected when you published the Job from the Talend Studio).
Click the Artifacts link to see a list of Jobs published as artifacts.
Select the Tasks icon (below the Artifacts link) to view the corresponding task related to the artifact.
Click the task to open the details page. It contains the following information:
Explore the Run History section. In this example, V1 of the Job was run on the Cloud Engine, and V2 of the Job was run on the Remote Engine. For more information on Cloud Engines, Remote Engines, and managing environments, see the Talend Cloud Management Console User Guide, Managing environments page.
Expand each run, then click the View Logs button to view the logs generated for each run.
It is not difficult to leverage Talend Cloud to build ingestion Jobs with Microsoft Azure HDInsight as a target. The most important aspect of building such a Job is to configure the connection details properly, and have the correct port numbers open for the communication to happen between Talend Cloud and Microsoft Azure HDInsight.
If you do not want data to transit through the Talend Cloud Engine, then you can provision a Remote Engine in your own subnet and have Talend Cloud manage the deployment and execution of the Job. The Job then runs on your Remote Engine, and the data is local to your private subnet.