Best practices for deploying Talend Jobs, and Frequently Asked Questions on libraries

Overview

Working with Talend products involves using many applications such as Talend Studio, managing deployments through Talend Administration Center (TAC), and using runtime environments like JobServers where Jobs run. This article provides best practices that Talend suggests you follow as you build, publish, and deploy your Jobs.

 

Prerequisites

  • Basic understanding of the purpose of different applications within Talend Platform such as Studio, Nexus, Git, TAC, CommandLine, and JobServers.
  • Knowledge about the software development life cycle.

 

Best Practices for deploying Talend Jobs

When Talend Studio does a build, it's very important to understand that Jobs are packaged in Zip files before they're pushed to an artifact repository like Nexus, and that all the necessary third-party tools required to run the Job are also present in the same artifact. When deploying these Job artifacts to JobServers, Talend recommends using artifact tasks to create Job tasks for deploying and running them on JobServers. The following flow depicts the step-by-step flow of how things work when using these various components.

Flow.png

 

Step 1: Connecting to Studio for the first time

When Talend Platform is freshly installed and you start Studio for first time, https://talend-update.talend.com/nexus/ is the Talend Online Nexus repository that is invoked to pull all the necessary libraries. Since these libraries are pulled to the local machine they can be found in the Studio_HOME\configuration directory. These libraries will not be pushed to the Nexus installed as part of Talend Platform, since the TAC portal is still not configured with the necessary configuration.

 

Step 2: Log in to TAC and configure

Within TAC, you configure the values for Artifact Repository and User Libraries in the Configuration section. Then you can create Users, Projects, and Project Authorization sections with Git credentials. See the article Custom libraries process for more information.

 

Note: The Git configuration in TAC is only required if you are building the artifacts from within TAC, and is not required when using Studio to build artifacts or for pushing/pulling content to source control servers.

 

Step 3: Creating Local/Remote Projects and working with Git/Nexus

When you connect to a local project, it’s not possible to access Git because access to TAC is not available, so the libraries present on your local machine will not be automatically synchronized with the Nexus deployed in the Talend installation.

 

Working in a collaborative environment often makes it mandatory to create a remote project by providing valid TAC credentials so Studio can leverage the connection details of Git and Nexus. When you connect to the remote project, Studio receives the Nexus details from the TAC and so its local .m2 repository libraries will be synchronized (uploaded/downloaded) automatically with Nexus.

 

It’s important that you follow some best practices for working with Git/SVN.

  1. Working directly on the master branch is not recommended. It is usually a best practice to check out and have a local copy of projects.
  2. For features and defect fixes, it is recommended that you create branches. Branching in TAC makes that branch available to all projects, so developers should be more careful about how to leverage branches across projects. This can initially create confusion and it is important to spend time to define the correct development workflow or software development life cycle (SDLC) before rolling out version control and artifact tools within teams.
  3. Tagging should also be leveraged to demarcate certain release items to give opportunities for rollback or checkout from specific point in time if needed.
  4. Before merging a branch to main master, ensure your local copy is not in conflict with check-ins done by other members of the team.
  5. Usually, if a project has more than around 200 Jobs, Talend recommends looking for opportunities to split the project into multiple projects, so the loading time for Studio is reduced and doesn't become a problem.
  6. If there is antivirus software being used, white list your libraries folder (STUDIO_HOME\configuration\.m2\repository) so it doesn’t get scanned each time, causing delays during checkouts.
  7. Even when working in Shared Desktop environments like Citrix, it’s not recommended that you work remotely directly in Git. You should create a shared file system (mount) where the workspace can be accessible even when you log in with a different session.
  8. When working with Git, it is recommended that you create branches or do merge activity using Talend Studio or TAC, and not to do it using GitHub or other similar interfaces.

 

Step 4: Publishing Binaries to Local Nexus

It is recommended that you publish directly to the Nexus SNAPSHOT folder when in development, and move to the RELEASE folder once the Job is stable. The Nexus details that are configured in the Windows > Preferences > Talend > Nexus > Artifact Repository are only used for publishing binaries to the Nexus repository. They are not used by Studio for regular synchronization of third-party or other custom libraries. See the article Custom libraries process for more information.

 

Notes:

  1. All the JAR files necessary for running Jobs are packaged as part of the Job Zip file.
  2. The third-party JAR files get pushed to Nexus only when connected to the remote projects using TAC. This process is completely independent of the process to publish code from Studio to Nexus.
  3. No Java code is included into the binaries unless specifically selected when publishing the Jobs to Nexus. So the binaries only contain context, scripts, and JAR files with byte code for execution.

 

Step 5: Create Artifact Tasks in TAC for deployment

When deploying Jobs to JobServers, it is recommended that you use artifact tasks, because then you are not regenerating the binaries from Jobs and you are using the binaries that were published to Nexus earlier.

 

Step 6: Deploying to Execution Servers

Deploy the Job artifacts to JobServers and run them. You will find that the cache of JobServers is best utilized when using artifact tasks for deployment, since it avoids creating separate folders every time you deploy, as when you use Studio to remotely deploy to JobServers.

 

Notes:

  1. It’s better to utilize Virtual Servers in clustered mode, so TAC can find the best JobServer available to deploy and run the Job automatically. Note this is not an HA configuration, and Tasks on an automatic schedule will start from the beginning.
  2. An Execution Plan can be utilized for scheduling and handling checkpoint failures.
  3. While TAC provides the ability to build Jobs using Git and using Export available from Studio, in normal task creations, tasks created from an artifact provide better benefits.

 

Step 7: Logging

According to the LOG4J logging levels set in Jobs, and the Statistics/FlowMeter components used, appropriate logs will be logged in ELK Logger and AMC respectively.

 

Step 8: Monitoring & Dashboard Reports

AMC can be leveraged to use FlowMeter, Statistics, and Logs. The Logger built using the ELK stack can be leveraged to see the raw logs from log files and build custom dashboards in Kibana. It's a best practice to enable these in all the environments so that features can be learned and best utilized.

 

Frequently Asked Questions on libraries

  1. Question: If Studio does not publish Jobs to Nexus, how do the libraries get synchronized to the local Nexus?

    Answer: When you connect to a remote project in Studio, the libraries get synchronized automatically between the local .m2 folder and Nexus. See the article Custom libraries process for more information.

     

  2. Question: Since TAC uses CommandLine for deploying and running Jobs, why are the libraries not getting synchronized to CommandLine .m2 repositories?

    Answer: CommandLine repositories (Talend_Home\cmdline\studio\configuration) only get synchronized with libraries from Nexus when CommandLine is used for generating binaries using TAC Job Conductor or API, and not when binaries from Nexus are deployed directly to the JobServer. As a Talend best practice, it is important to note that CommandLine should not be used to build Jobs; Maven should be used instead.

     

  3. Question: What if CommandLine does not have all required libraries to run a Job, and the local Nexus also does not have them?

    Answer: If Studio has internet access, then the setup is incorrect. If the entire Talend installation, including Studios, has no internet access the libraries will need to be manually copied to Nexus from an internet-enabled Studio. In both cases, see the article Custom libraries process for more information.

     

  4. Question: Why does deploying Spark Jobs from Studio take so much time?

    Answer: Normally, the Studio on which developers are working is in a distant location compared to the edge server nodes of JobServers. Therefore, Talend recommends using TAC to deploy binary artifacts, as TAC servers are usually part of same VPN where the JobServers are located.

     

  5. Question: When running a Job from TAC or Studio, do the libraries get copied to the Job Server every time?

    Answer: Each publish of a Job creates a completely standalone package, including all required libraries. So the libraries are copied to the JobServer every time, but the JobServer maintains a cache of the libraries it needs.

 

Reference

https://community.talend.com/t5/Architecture-Best-Practices-and/Custom-libraries-process/ta-p/30623

Version history
Revision #:
11 of 11
Last update:
‎04-09-2018 07:12 PM
Updated by: