Boosting your Talend Big Data Platform

Deloitte.png

Article contributed by Deloitte Solutions Network (SNET)

Authors: Sanyyam Kumar Gupta, Shruti Daga, and Suman Chakraborty

September 20, 2017

 

bulbs.png

Executive Summary

The purpose of this article is to highlight leading practices to follow when managing Talend in a Hadoop cluster ecosystem. As firms move to execute business processes in near-real time, IT systems need to provide high resilience as well as good performance. A poor start can jeopardize the overall success. The following steps and guidelines can help to achieve a stable platform while typically minimizing the possibility of any potential setbacks.

 

Best Practices for Talend Big Data platform

The following suggestions have been made with respect to Talend Big Data Enterprise version 6.2.1, deployed on Linux servers, that connects to Hortonworks distribution platform (HDP):

  • Optimizing high availability using a shared file system

    In order to have consistently-available Talend Administration Centers (TACs), the secondary TAC should be able to resume operations at the point where the primary TAC left off. This will establish the fact that for either an unexpected outage, or for planned maintenance, seamless service is available. This entails a need to have information such as audit reports, Jobs' logs, and deployed Jobs highly available to both TACs. Therefore, best practice would be to have a Network File System (NFS) mount point on its own independent server that is accessible to both TACs. This is one of the prerequisites and should be taken care of immediately (if not already completed) in existing HA environments.

     

  • Improving parallelism success using Tez

    Tez does not support executing multiple Hive queries in the same HiveServer2 session. For instance, where you implement query parallelism using a tHiveInput component, deselect Use an existing connection. This will confirm that the same Tez Application Master does not end up executing multiple queries. Deselecting that option will make the subsequent query wait for a new Hive session before getting submitted. This is applicable for TEZ mode of execution, and does not hold true for MR mode.

     

  • Commit node resources to Talend

    Dedicate the JobServer node to run only the JobServer daemon. Other applications sharing the node might adversely affect the star rating of the node, resulting in inaccurate server allocation using a Virtual Server.

     

    If JobServer nodes run Hadoop daemons as well, ensure that you schedule your resource-intensive Big Data Jobs during off hours or intervals of low usage. This will allow sufficient bandwidth for standard Jobs to make use of node resources.

     

  • Virtual Server allocation

    When allocating JobServers to different virtual servers, separate the JobServers project-wise based on utilization and number of users. Heavily-used projects will be allocated more JobServers as compared to less-utilized ones. Enforce Access Control Lists (ACLs) to allow project-wise access to allocated virtual servers. This allows for fair access to cluster resources.

     

  • Limiting thread allocation

    If the number of Jobs being run is high, deselect Use Independent Process. This helps to limit the number of threads being used for Job execution at the server level. This also helps to manage Job execution within the available resources. Alternatively, increase the value of nproc and nofile for the talend user to 99999.

     

  • Setting monitoring alerts

    Creation of notifications is a good practice to reduce manual monitoring. Notifications should be created from Talend Administration Center and be integrated with the SMTP server to send emails. There are principally two activities for which notifications should be mandated:

    • JobServers’ status: Create an alert for all administrators to be notified when a JobServer goes down.
    • Monitoring Job status: Developers can create custom alerts using the Notification functionality to receive status changes of the Job being run.

     

  • JobServer as a service

    In Linux, JobServer should be installed to run as a service. If run as a script using any but the talend user, JobServer functionality will be severely affected.

     

  • Drools – Memory optimization

    Drools is a RAM-intensive application. Unless there is a definitive business requirement to create customized business rules, Drools service should be turned off to save significant memory in the TAC servers.

     

  • Leverage jstatd to optimize queries

    The Jstatd tool provides an interface to allow remote monitoring tools to attach to Java Virtual Machines (JVMs) running on the host. Using jvisualvm.exe installed as part of the Java Development Kit, you will be able to monitor the memory, CPU, threads, and classes for the specific Java process in each of the JobServers. This allows you to track your high-usage intervals and manage your cluster more efficiently. Also, once the resource-intensive Jobs and workflows are identified using jstatd, leadership can orient efforts towards tuning specific queries, resulting in potentially significant improvements in overall performance.

     

  • Metadata (Database) Management

    TAC is merely a UI tool representing the metadata stored in an RDBMS. Hence, it is crucial that the integration between TAC and the database is fine-tuned to make sure that the data being reflected is current and precise. To reduce the delays, keep the database in the same network as the Talend setup. Schedule daily backups of the database by using the Backup capability offered by TAC. In the case of Enterprise Edition databases, the backups should be taken care of from the database end. For a highly-available environment, this database should be present on a separate server that is independent from the two TAC servers.

     

 

Conclusion

This leading practices document explores the gap between an organization’s default settings of Talend Big Data ecosystem and performance-optimizing techniques that can lead to much faster resolution of business issues. This research highlights a few ways in which Talend administrators can effectively tune environments to best suit their use case. It also explores a few tips and tricks to improve an existing Talend environment to cater to complex analytical solutions.

 

Note that proper utilization of the Talend Big Data platform can deter excessive cost overhead and identify lapses in existing hardware instances. Proper utilization can also enhance customer satisfaction by properly allocating resources to user Jobs and improve product reliability.

 

Contacts

For more information about this article, please contact:

Shruti Daga

Consultant

Deloitte Consulting LLP

Deloitte C Block

RMZ Futura, Plot No 14 & 15

Hyderabad 500081, India

+1 615 718 7608

shrdaga@deloitte.com

 

Sanyyam Gupta

Consultant

Deloitte Consulting LLP

Deloitte C Block

RMZ Futura, Plot No 14 & 15

Hyderabad 500081, India

+1 615 718 4227

sanyygupta@deloitte.com

 

Suman Chakraborty

Senior Consultant

Deloitte Consulting LLP

Deloitte C Block

RMZ Futura, Plot No 14 & 15

Hyderabad 500081, India

+1 615 718 6627

sumchakraborty@deloitte.com

 

As used in this document, “Deloitte” means Deloitte Consulting LLP, a subsidiary of Deloitte LLP. Please see http://www.deloitte.com/us/about for a detailed description of our legal structure. Certain services may not be available to attest clients under the rules and regulations of public accounting.

 

This communication contains general information only, and none of Deloitte Touche Tohmatsu Limited, its member firms or their related entities (collectively, the "Deloitte Network"), is, by means of this communication, rendering professional advice or services. Before making any decisions or taking any action that may affect your finances, or your business, you should consult a qualified professional adviser. No entity in the Deloitte Network shall be responsible for any loss whatsoever sustained by any person who relies on this communication.

 

Copyright© 2017 Deloitte development LLC. All rights reserved

Version history
Revision #:
13 of 13
Last update:
‎10-06-2017 10:54 AM
Updated by:
 
Labels (2)
Contributors
Tags (1)