Working with Talend products involves using many applications such as Talend Studio, managing deployments through Talend Administration Center (TAC), and using runtime environments like JobServers where Jobs run. This article provides best practices that Talend suggests you follow as you build, publish, and deploy your Jobs.
When Talend Studio does a build, it's very important to understand that Jobs are packaged in Zip files before they're pushed to an artifact repository like Nexus, and that all the necessary third-party tools required to run the Job are also present in the same artifact. When deploying these Job artifacts to JobServers, Talend recommends using artifact tasks to create Job tasks for deploying and running them on JobServers. The following flow depicts the step-by-step flow of how things work when using these various components.
When Talend Platform is freshly installed and you start Studio for first time, https://talend-update.talend.com/nexus/ is the Talend Online Nexus repository that is invoked to pull all the necessary libraries. Since these libraries are pulled to the local machine they can be found in the Studio_HOME\configuration directory. These libraries will not be pushed to the Nexus installed as part of Talend Platform, since the TAC portal is still not configured with the necessary configuration.
Within TAC, you configure the values for Artifact Repository and User Libraries in the Configuration section. Then you can create Users, Projects, and Project Authorization sections with Git credentials. See the article Custom libraries process for more information.
Note: The Git configuration in TAC is only required if you are building the artifacts from within TAC, and is not required when using Studio to build artifacts or for pushing/pulling content to source control servers.
When you connect to a local project, it’s not possible to access Git because access to TAC is not available, so the libraries present on your local machine will not be automatically synchronized with the Nexus deployed in the Talend installation.
Working in a collaborative environment often makes it mandatory to create a remote project by providing valid TAC credentials so Studio can leverage the connection details of Git and Nexus. When you connect to the remote project, Studio receives the Nexus details from the TAC and so its local .m2 repository libraries will be synchronized (uploaded/downloaded) automatically with Nexus.
It’s important that you follow some best practices for working with Git/SVN.
It is recommended that you publish directly to the Nexus SNAPSHOT folder when in development, and move to the RELEASE folder once the Job is stable. The Nexus details that are configured in the Windows > Preferences > Talend > Nexus > Artifact Repository are only used for publishing binaries to the Nexus repository. They are not used by Studio for regular synchronization of third-party or other custom libraries. See the article Custom libraries process for more information.
When deploying Jobs to JobServers, it is recommended that you use artifact tasks, because then you are not regenerating the binaries from Jobs and you are using the binaries that were published to Nexus earlier.
Deploy the Job artifacts to JobServers and run them. You will find that the cache of JobServers is best utilized when using artifact tasks for deployment, since it avoids creating separate folders every time you deploy, as when you use Studio to remotely deploy to JobServers.
According to the LOG4J logging levels set in Jobs, and the Statistics/FlowMeter components used, appropriate logs will be logged in ELK Logger and AMC respectively.
AMC can be leveraged to use FlowMeter, Statistics, and Logs. The Logger built using the ELK stack can be leveraged to see the raw logs from log files and build custom dashboards in Kibana. It's a best practice to enable these in all the environments so that features can be learned and best utilized.
Question: If Studio does not publish Jobs to Nexus, how do the libraries get synchronized to the local Nexus?
Answer: When you connect to a remote project in Studio, the libraries get synchronized automatically between the local .m2 folder and Nexus. See the article Custom libraries process for more information.
Question: Since TAC uses CommandLine for deploying and running Jobs, why are the libraries not getting synchronized to CommandLine .m2 repositories?
Answer: CommandLine repositories (Talend_Home\cmdline\studio\configuration) only get synchronized with libraries from Nexus when CommandLine is used for generating binaries using TAC Job Conductor or API, and not when binaries from Nexus are deployed directly to the JobServer. As a Talend best practice, it is important to note that CommandLine should not be used to build Jobs; Maven should be used instead.
Question: What if CommandLine does not have all required libraries to run a Job, and the local Nexus also does not have them?
Answer: If Studio has internet access, then the setup is incorrect. If the entire Talend installation, including Studios, has no internet access the libraries will need to be manually copied to Nexus from an internet-enabled Studio. In both cases, see the article Custom libraries process for more information.
Question: Why does deploying Spark Jobs from Studio take so much time?
Answer: Normally, the Studio on which developers are working is in a distant location compared to the edge server nodes of JobServers. Therefore, Talend recommends using TAC to deploy binary artifacts, as TAC servers are usually part of same VPN where the JobServers are located.
Question: When running a Job from TAC or Studio, do the libraries get copied to the Job Server every time?
Answer: Each publish of a Job creates a completely standalone package, including all required libraries. So the libraries are copied to the JobServer every time, but the JobServer maintains a cache of the libraries it needs.