Microsoft Azure HDInsight Hive analytics from Talend Cloud

Overview

This article shows you how to use Talend Cloud to analyze data stored on Azure HDInsight Hive tables.

 

The article is a continuation of the Hive Connection to Microsoft HDInsight from Talend Studio and Talend Cloud article, available on the Talend Community Knowledge Base (KB).

 

Environment

This article was written using Talend Studio 7.1.1.

 

Prerequisites

 

Talend DI Job for analysis on HDInsight Hive data

This section looks at a Talend Data Integration (DI) Standard Job used to run Hive analytics. Use case scenarios for this article includes filtering, aggregation, sorting, and joins, on data stored in two different Hive tables.

hiveanalysis.png

 

The diagram above shows the overall flow of the use case. The various stages involved in the use case are:

  • The primary data source for the Job is HDInsight Hive. The two tables in Hive are Customers and State Codes.
  • The Job includes filtering, aggregation, sorting, and joins, on data stored in the Hive tables.
  • Using the Publish to Cloud option in Studio, the Job is published to Talend Cloud.

    Note: You can also publish the Job can using the Continuous Integration (CI) process. For more information on publishing to the cloud using Continuous Integration, see Implementation with Talend Cloud, on the Talend Help Center.

  • The Job is deployed as an artifact in Talend Cloud.
  • A corresponding task is automatically created in Talend Cloud.

    Note: You can use the same artifact to create one or more tasks manually in Talend Cloud.

  • The Task is executed on the Cloud Engine.

    Note: You can also execute the Task on a Remote Engine. For more information on Remote Engines, see Working with Talend Remote Engines, on the Talend Help Center.

 

Taking a closer look at the Job

  • Sample data for the Job is created using a set of DI Jobs that use a tRowGenerator component to generate random customer data, and a tHiveLoad component to load data into HDInsight Hive.

    data_generation_job.png

     

  • In this case, the state_codes table data is loaded using the Hive UI in HDInsight Ambari. Thus, creating two Hive tables: the customers table contains example attributes of the customers’ personal information, such as customer ID, first name, last name, age, gender, phone, email, loyalty group, and the US state code. The second table, state_codes, contains a list of US state codes and state names (for the join use case).

    job_usecases.png

     

  • The diagram above illustrates the Job, and the highlighted areas represent the use case as follows:

    1. Uses tAggregrate components to find the count of members and non-members based on customer loyalty groups. tLogRow components are used to output the aggregated data into the console.

      In a real-world scenario, you can use the aggregated output for further analysis, for example - displayed as part of a BI dashboard or used in machine learning algorithms, and so on.

    2. Counts the members by state and sorts in descending order of the member count, then displays the output to the console.

 

Configuring the components

  • tHiveConnection component (HDI_Hive) is used to connect to Hive running on Azure HDInsight cluster
  • tHiveInput components (customers and state_codes) are used to fetch data from Hive tables
  • tMap component is used to join the customers table with the state_codes table to get state names based on the stateid column, and to create necessary outputs for aggregating the data

    tmap.png

     

    There are three outputs each defined for:

    1. Filtering by the loyalty_group column with row values 1 or 2 as Platinum customers
    2. loyalty_group column values greater than 2 are set as Gold customers
    3. Rows with loyalty_group column value 1 or 2 and member column with true values filtered for deriving state-wise Platinum members
  • After joining the data and creating the necessary outputs in the tMap, three tAggregate components are used to group respective output data by loyalty_group and generate the count of customers by this grouping. The configuration details for each of the tAggregate components is shown below:

    taggregate_1.png

     

    taggregate_2.png

     

    taggregate_3.png

     

  • The tSort component is used to sort the state-wise Platinum members in descending order by customersid count.tsort.png

     

  • Three tLogRow components are used to display the respective outputs to the console.

    studio_output_1.png

    studio_output_2.png

     

  • The tHiveInput (customers table) component query is configured to limit the results displayed in the console output.

    studio_output1000_1.png

    studio_output1000_2.png

     

This is sample data generated programmatically; thus the distribution of values are consistent; in a real-world scenario, similar analysis on data would return different insightful results.

 

Creating a Hive analysis Task in Talend Cloud

This section shows you how to use a Task on Talend Cloud to run the Job.

  • Publishing the Job to Talend Cloud, creates the Task, HiveAnalysis_CustomerLoyalty.

    TalendCloud_tasks.png

    In this case, the Task is run manually. Talend Cloud provides options to run tasks manually, scheduling it daily/weekly/monthly, or triggered through a webhook. A task can also be part of a plan. Plans are a set of tasks scheduled in the order of execution depending on the requirement.

  • Task executions are added to the Run History, on successful run of the Job the logs have console output. You can display the logs by clicking View Logs in the Run History.

    TalendCloud_logs_1.png

    TalendCloud_logs_2.png

    TalendCloud_logs_3.png

     

Conclusion

This article helps you to understand how to create analytics use cases for Hive data stored on Azure HDInsight cluster with the help of a Talend Job, run as a Task, on Talend Cloud. Talend Cloud makes it easy to run Tasks for Hive analytics, using either Cloud Engines or Remote Engines.

Version history
Revision #:
13 of 13
Last update:
‎04-22-2019 06:29 AM
Updated by: