This article shows you how to use Talend Cloud to analyze data stored on Azure HDInsight Hive tables.
The article is a continuation of the Hive Connection to Microsoft HDInsight from Talend Studio and Talend Cloud article, available on the Talend Community Knowledge Base (KB).
This article was written using Talend Studio 7.1.1.
This section looks at a Talend Data Integration (DI) Standard Job used to run Hive analytics. Use case scenarios for this article includes filtering, aggregation, sorting, and joins, on data stored in two different Hive tables.
The diagram above shows the overall flow of the use case. The various stages involved in the use case are:
Using the Publish to Cloud option in Studio, the Job is published to Talend Cloud.
Note: You can also publish the Job can using the Continuous Integration (CI) process. For more information on publishing to the cloud using Continuous Integration, see Implementation with Talend Cloud, on the Talend Help Center.
A corresponding task is automatically created in Talend Cloud.
Note: You can use the same artifact to create one or more tasks manually in Talend Cloud.
The Task is executed on the Cloud Engine.
Sample data for the Job is created using a set of DI Jobs that use a tRowGenerator component to generate random customer data, and a tHiveLoad component to load data into HDInsight Hive.
In this case, the state_codes table data is loaded using the Hive UI in HDInsight Ambari. Thus, creating two Hive tables: the customers table contains example attributes of the customers’ personal information, such as customer ID, first name, last name, age, gender, phone, email, loyalty group, and the US state code. The second table, state_codes, contains a list of US state codes and state names (for the join use case).
The diagram above illustrates the Job, and the highlighted areas represent the use case as follows:
Uses tAggregrate components to find the count of members and non-members based on customer loyalty groups. tLogRow components are used to output the aggregated data into the console.
In a real-world scenario, you can use the aggregated output for further analysis, for example - displayed as part of a BI dashboard or used in machine learning algorithms, and so on.
There are three outputs each defined for:
After joining the data and creating the necessary outputs in the tMap, three tAggregate components are used to group respective output data by loyalty_group and generate the count of customers by this grouping. The configuration details for each of the tAggregate components is shown below:
The tSort component is used to sort the state-wise Platinum members in descending order by customersid count.
Three tLogRow components are used to display the respective outputs to the console.
The tHiveInput (customers table) component query is configured to limit the results displayed in the console output.
This is sample data generated programmatically; thus the distribution of values are consistent; in a real-world scenario, similar analysis on data would return different insightful results.
This section shows you how to use a Task on Talend Cloud to run the Job.
Publishing the Job to Talend Cloud, creates the Task, HiveAnalysis_CustomerLoyalty.
In this case, the Task is run manually. Talend Cloud provides options to run tasks manually, scheduling it daily/weekly/monthly, or triggered through a webhook. A task can also be part of a plan. Plans are a set of tasks scheduled in the order of execution depending on the requirement.
Task executions are added to the Run History, on successful run of the Job the logs have console output. You can display the logs by clicking View Logs in the Run History.
This article helps you to understand how to create analytics use cases for Hive data stored on Azure HDInsight cluster with the help of a Talend Job, run as a Task, on Talend Cloud. Talend Cloud makes it easy to run Tasks for Hive analytics, using either Cloud Engines or Remote Engines.