For configuring each users Studio to Git via SSH. We need to repeat the steps from 1-10 and register them individually. Also some times studio would have been installed in different account and try to access it via other useraccount in which case studio it looks for the keys, known host and configs under current users account ~/.ssh/ which may not be found. Resolution : Create the .ssh directory if not yet existing under your current user directory. Copy all the keys from standard accounts ~/.ssh to the current users ~/.ssh
... View more
You can use Cache or Hash Component to load you rules and use them in combination with tJava or tJavaRow to apply the rules on the individual rows of data by splitting them based on delimiters in columns.
... View more
Hi, 1st: The reason for save to not work in most cases is that the RSA key is not trusted. When this occurs you will need to go back to the SSH key deployment section of your account. Perhaps you may want to read the key as there may be an issue with it. Another option is to add the key straight to the project instead of to the user. To do this, go to the project itself -> Settings -> Deploy keys and add the RSA key to the project. 2nd: Please check the user and hostname are correct. email@example.com:arthur-talend/bitbucket-test.git in your case, the git needs to be replaced with appropriate user. The host with correct name. 3rd: We have found in few cases re-creating the repository after configuring the SSH has worked. Let us know if this helps. Thank you.
... View more
Working with Talend products involves using many applications such as Talend Studio, managing deployments through Talend Administration Center (TAC), and using runtime environments like JobServers where Jobs run. This article provides best practices that Talend suggests you follow as you build, publish, and deploy your Jobs.
Basic understanding of the purpose of different applications within Talend Platform such as Studio, Nexus, Git, TAC, CommandLine, and JobServers.
Knowledge about the software development life cycle.
Best Practices for deploying Talend Jobs
When Talend Studio does a build, it's very important to understand that Jobs are packaged in Zip files before they're pushed to an artifact repository like Nexus, and that all the necessary third-party tools required to run the Job are also present in the same artifact. When deploying these Job artifacts to JobServers, Talend recommends using artifact tasks to create Job tasks for deploying and running them on JobServers. The following flow depicts the step-by-step flow of how things work when using these various components.
Step 1: Connecting to Studio for the first time
When Talend Platform is freshly installed and you start Studio for first time, https://talend-update.talend.com/nexus/ is the Talend Online Nexus repository that is invoked to pull all the necessary libraries. Since these libraries are pulled to the local machine they can be found in the Studio_HOME\configuration directory. These libraries will not be pushed to the Nexus installed as part of Talend Platform, since the TAC portal is still not configured with the necessary configuration.
Step 2: Log in to TAC and configure
Within TAC, you configure the values for Artifact Repository and User Libraries in the Configuration section. Then you can create Users, Projects, and Project Authorization sections with Git credentials. See the article Custom libraries process for more information.
Note: The Git configuration in TAC is only required if you are building the artifacts from within TAC, and is not required when using Studio to build artifacts or for pushing/pulling content to source control servers.
Step 3: Creating Local/Remote Projects and working with Git/Nexus
When you connect to a local project, it’s not possible to access Git because access to TAC is not available, so the libraries present on your local machine will not be automatically synchronized with the Nexus deployed in the Talend installation.
Working in a collaborative environment often makes it mandatory to create a remote project by providing valid TAC credentials so Studio can leverage the connection details of Git and Nexus. When you connect to the remote project, Studio receives the Nexus details from the TAC and so its local .m2 repository libraries will be synchronized (uploaded/downloaded) automatically with Nexus.
It’s important that you follow some best practices for working with Git/SVN.
Working directly on the master branch is not recommended. It is usually a best practice to check out and have a local copy of projects.
For features and defect fixes, it is recommended that you create branches. Branching in TAC makes that branch available to all projects, so developers should be more careful about how to leverage branches across projects. This can initially create confusion and it is important to spend time to define the correct development workflow or software development life cycle (SDLC) before rolling out version control and artifact tools within teams.
Tagging should also be leveraged to demarcate certain release items to give opportunities for rollback or checkout from specific point in time if needed.
Before merging a branch to main master, ensure your local copy is not in conflict with check-ins done by other members of the team.
Usually, if a project has more than around 200 Jobs, Talend recommends looking for opportunities to split the project into multiple projects, so the loading time for Studio is reduced and doesn't become a problem.
If there is antivirus software being used, white list your libraries folder (STUDIO_HOME\configuration\.m2\repository) so it doesn’t get scanned each time, causing delays during checkouts.
Even when working in Shared Desktop environments like Citrix, it’s not recommended that you work remotely directly in Git. You should create a shared file system (mount) where the workspace can be accessible even when you log in with a different session.
When working with Git, it is recommended that you create branches or do merge activity using Talend Studio or TAC, and not to do it using GitHub or other similar interfaces.
Step 4: Publishing Binaries to Local Nexus
It is recommended that you publish directly to the Nexus SNAPSHOT folder when in development, and move to the RELEASE folder once the Job is stable. The Nexus details that are configured in the Windows > Preferences > Talend > Nexus > Artifact Repository are only used for publishing binaries to the Nexus repository. They are not used by Studio for regular synchronization of third-party or other custom libraries. See the article Custom libraries process for more information.
All the JAR files necessary for running Jobs are packaged as part of the Job Zip file.
The third-party JAR files get pushed to Nexus only when connected to the remote projects using TAC. This process is completely independent of the process to publish code from Studio to Nexus.
No Java code is included into the binaries unless specifically selected when publishing the Jobs to Nexus. So the binaries only contain context, scripts, and JAR files with byte code for execution.
Step 5: Create Artifact Tasks in TAC for deployment
When deploying Jobs to JobServers, it is recommended that you use artifact tasks, because then you are not regenerating the binaries from Jobs and you are using the binaries that were published to Nexus earlier.
Step 6: Deploying to Execution Servers
Deploy the Job artifacts to JobServers and run them. You will find that the cache of JobServers is best utilized when using artifact tasks for deployment, since it avoids creating separate folders every time you deploy, as when you use Studio to remotely deploy to JobServers.
It’s better to utilize Virtual Servers in clustered mode, so TAC can find the best JobServer available to deploy and run the Job automatically. Note this is not an HA configuration, and Tasks on an automatic schedule will start from the beginning.
An Execution Plan can be utilized for scheduling and handling checkpoint failures.
While TAC provides the ability to build Jobs using Git and using Export available from Studio, in normal task creations, tasks created from an artifact provide better benefits.
Step 7: Logging
According to the LOG4J logging levels set in Jobs, and the Statistics/FlowMeter components used, appropriate logs will be logged in ELK Logger and AMC respectively.
Step 8: Monitoring & Dashboard Reports
AMC can be leveraged to use FlowMeter, Statistics, and Logs. The Logger built using the ELK stack can be leveraged to see the raw logs from log files and build custom dashboards in Kibana. It's a best practice to enable these in all the environments so that features can be learned and best utilized.
Frequently Asked Questions on libraries
Question: If Studio does not publish Jobs to Nexus, how do the libraries get synchronized to the local Nexus?
Answer: When you connect to a remote project in Studio, the libraries get synchronized automatically between the local .m2 folder and Nexus. See the article Custom libraries process for more information.
Question: Since TAC uses CommandLine for deploying and running Jobs, why are the libraries not getting synchronized to CommandLine .m2 repositories?
Answer: CommandLine repositories (Talend_Home\cmdline\studio\configuration) only get synchronized with libraries from Nexus when CommandLine is used for generating binaries using TAC Job Conductor or API, and not when binaries from Nexus are deployed directly to the JobServer. As a Talend best practice, it is important to note that CommandLine should not be used to build Jobs; Maven should be used instead.
Question: What if CommandLine does not have all required libraries to run a Job, and the local Nexus also does not have them?
Answer: If Studio has internet access, then the setup is incorrect. If the entire Talend installation, including Studios, has no internet access the libraries will need to be manually copied to Nexus from an internet-enabled Studio. In both cases, see the article Custom libraries process for more information.
Question: Why does deploying Spark Jobs from Studio take so much time?
Answer: Normally, the Studio on which developers are working is in a distant location compared to the edge server nodes of JobServers. Therefore, Talend recommends using TAC to deploy binary artifacts, as TAC servers are usually part of same VPN where the JobServers are located.
Question: When running a Job from TAC or Studio, do the libraries get copied to the Job Server every time?
Answer: Each publish of a Job creates a completely standalone package, including all required libraries. So the libraries are copied to the JobServer every time, but the JobServer maintains a cache of the libraries it needs.
... View more
In today’s IT paradigm, where Cloud or On-Premise systems are working in a distributed way, exposing API’s has become a norm. RESTFUL API’s interface being light on both memory and bandwidth footprint are becoming quite normal for exposing key business functionality. Hence, keeping the security interest in mind, the following article gives a tour of how to enable SAML-based authentication and authorization leveraging Talend components, IAM service, and other supporting platform tools.
Basic knowledge of WS-Security and Apache CXF
Have installed subscription version of Talend Studio, IAM, Runtime, and ESB modules
Basic understanding of the JSON format and how to parse it for extracting values and writing it
Knowledge of calling SOAP/REST web services with SAML based authentication
RESTFUL API Creation and Creation of Build
Simple RESTFUL API for fetching meeting times from a csv flat file, and returning the content as XML is shown below.
Creation of build by right-clicking on the Job and selecting the Build Job Option. Select the build type OSGI bundle for ESB.
Environment Setup, Deployment, and Configuration
After installing Talend Data Fabric (specifically ESB, TAC, IAM, and Runtime) ensure these services are up and running.
The following snapshot shows checking if the services are running:
If they aren't, start the services using this command:
service Talend-XXX start
Create Groups and Users in Talend IAM. When creating the users, be sure to map them to the employee roles as given below. You need to login with proper Admin rights for creating the necessary users.
To ensure that STS and Authorization service is up and running in the Runtime service, log in to the Karaf client with the default credentials karaf/karaf (unless they have been changed).
Check that the services are up and running. If not, use the highlighted commands below to ensure they are running. Verify once started using the commands again.
Deploy the RESTFUL API Job, either using TAC (this is the recommended way) or by copying it directly to the /runtime/deploy folder. For the current exercise, copy it directly to the deploy folder.
Copy the flat file to location configured in the Job. For the above example job, use the following location:
Apache Syncope is required for the authentication and authorization scenario. In a default installation, TAC contacts Syncope to pull down the roles/groups used to create authorization policies. Instead, you can type tesb:switch-sts-tidm in the Karaf container, and it will switch to using Syncope instead of JAAS for authentication.
Verification of users in TAC and resource access configuration
With TAC running, log on to: http://localhost:8080/org.talend.administrator-6.4.1.
Go to Users and change Type to ESB, then grant all Roles to the user you logged in as. Click Validate, then Save.
An ESB Infrastructure / Authorization tab will appear.
Important: If the setup is new, then you will not see an Authorization tab. Set up the user with the appropriate Type and Roles as shown below:
Resource configuration in TAC for RESTFUL webservice against roles/users from Talend IAM
In Authorization, under Roles, select all then select the Role (Group) you associated with the user in Syncope. If it fails, verify that the IAM service is working, and log in to re-verify.
Under Resources, click Add and Individual Resource.
For the resource, specify:
Click Show in the bottom bar. Change the default action to GET and click the role you have configured.
After some time, the authorization policy should be synced to the PDP repo, and the new invocation on the REST endpoint using the token retrieved from SoapUI should be authorized.
Testing the RESTFUL web service using SoapUI
You are going to use SoapUI to test the RESTFUL service, so if it’s not already installed you need to install it. For example, you can install SoapUI on CENTOS 7 using the following instructions:
Creation of SOAP Request and Invocation
You can create the SOAP Request as shown in this snapshot, following the instructions given in this blog:
Before firing the SOAP request, you need to ask the STS for a SAML Token with the role information encoded in it. So in SoapUI, add the following to the RequestSecurityToken part of the request:
<Claims Dialect="http://schemas.xmlsoap.org/ws/2005/05/identity" xmlns="http://docs.oasis-open.org/ws-sx/ws-trust/200512">
<ClaimType Uri="http://schemas.xmlsoap.org/ws/2005/05/identity/claims/role" xmlns="http://schemas.xmlsoap.org/ws/2005/05/identity"/>
The username/password is the same you created in the Syncope portal.
The SOAP Response on successful invocation of STS is shown below:
Switching to RAW XML and extracting SAML Assertion
Once you have a successful response, then click on the RAW tab and copy the SOAP Body into an editor, stripping everything before <saml2:Assertion> and after </saml2:Assertion> at the end. It's important to use RAW before copying, as any whitespace change will break signature validation.
Deflating and Base64 encoding
To call the REST API using an authentication token, you need to deflate and base64-encode a SAML Message before sending it.
Copy the SAML Assertion as extracted above and go to: https://www.samltool.com/encode.php. Paste it into the first field, and click on Deflate and Encode the XML. Copy the resulting text in Deflated and Encoded XML.
Important: Deflating had to be done because OPEN UI does not provide direct support to calling a RESFTUL API with a SAML Token as per their support tickets.
Invocation of REST service using a curl command
curl -v -H "Authorization: SAML <token>" http://localhost:8088/webinar/meetingtimes
Replace <token> with the content you copied above from samltool.com. Ensure the RESTFUL URL in the curl command is as configured in the component of the Job Flow. Hit Enter to see the results.
Note: Chrome plugin POSTMAN tool can also be used to execute the curl command.
Hurray!! You can congratulate yourself that you have your RESTFUL API working with SAML Authentication/Authorization enabled using Talend IAM.
... View more
Having access control on data when executing Jobs on an execution server in Talend is one of the critical business requirements that clients look for in a platform. The need for compliance can be due to regulatory reasons or internal business confidentially. So users often create separate Service accounts to manage the access control in the JobServer and execute Job tasks accordingly. But this can be cumbersome, time consuming, and limit audit capabilities if done outside the toolsets of Talend.
When using Talend Administration Console (TAC), one of the improvements can be to have multiple JobServers deployed, with each server mapped to individual service accounts for controlling the access of data. Once you lock the service accounts and JobServer combinations, no user can be configured to use RUN AS capabilities for executing JobServer tasks, because sudo capability (in other words, running programs with the security privileges of another user) can be curtailed when creating the service accounts. Taking this design approach, the following article describes a simple Job design to automate the creation of multiple JobServers, and to configure them in TAC to enable access control when executing the Jobs.
Creating a DI Job to create multiple JobServers
As shown in the diagram above, the process to automate the process of creating multiple JobServers is depicted in multiple steps below.
Validate the Context Values in properties file for ports and other details.
Fetch the JobServer Zip file from the shared folder and unarchive it in the current folder.
Create a copy of the extracted Zip file.
Modify the JobServer Properties files according to the configuration for the host machine.
Rename the JobServer according to the Service Account or other business conventions.
Note: This process is explained in the Talend Data Fabric installation guide provided as part of your product installation, and is available in the Talend Help Center. Navigate to Installing and Configuring Talend server modules, then scroll down to the Installing and configuring your JobServers section.
Details of the configurable default values of the Job context are listed below:
Running the DI Job
From Studio, build the preceding Job as a binary so it can be used as a utility without further need for Studio.
Copy the binary utility to any appropriate place and extract it.
Navigate to the folder \src\main\resources\proj631\jobserver_0_1\contexts.
Edit and modify the Default.properties file to change the values according to the HOST environment where you plan to deploy the JobServer using the available ports. Importantly, ensure there are no firewalls or other software rules restricting access to the configured ports.
If you don’t want to modify the Default.properties directly, then during execution of the Job, parameters can be set on runtime by providing inline parameters such as:
JobServer_run.bat --context=Default --context_param service_account=svc_talend --context_param MONITORING_PORT=8878 ……
Running a JobServer as service
After the Job finishes successfully, a new folder will be created in the same folder where the Archive file is placed. The folder will be named according to the service account name given.
Navigate to the folder \svc_talend\Talend-JobServer-20170623_1246-V6.4.1\conf\opensuse_service.
Follow the instructions in the README.txt file to enable the JobServer as a Service.
Setting up TAC with JobServer / Service Accounts mapping to restrict access control
In TAC, access to resources is provided through the project(s). All the resources are tied to the project, and then you give users access to the project(s).
Setting up projects
In a Production TAC, you can create projects with the None storage option as shown below. In this case, there is no source repository like SVN or Git behind this project. The project is simply a Label to attach our resources to for deployment of the binaries and security purposes.
Depending on how closely you want to manage access to the JobServers, and thus to the Service Accounts, you may decide to create one or more such projects. You can even create 21 such projects to enable you to manage access to the JobServers on a very granular level.
Setting up project authorization
Once you have the projects created, you can assign Ops Users the Operation Manager role and give them Read access to the project(s). In the diagram below, User1 has access to two projects named TENANT1 and TENANT2.
Setting up JobServer agents
Set up one JobServer for each Service Account by duplicating the JobServer directory so that you have one directory for each JobServer, as shown below:
Edit the start_jobserver.sh and stop_jobserver.sh shell scripts to use the correct directory path for each JobServer.
Edit the TalendJobServer.properties file so that the three ports used for the command, file transfer, and monitoring ports are all different in all the JobServers.
Set up the RUN_AS_WHITELIST parameter for each JobServer, as shown below, to further ensure the fact that no other user, other than the whitelisted one, can execute jobs through this JobServer.
Set up a user in the users.csv file that you will use when setting up the JobServer. This user is a JobServer User, and they can be different for each JobServer. This is to prevent anyone from connecting to the JobServer from outside the TAC, for example from the Studio, and submitting Jobs through that JobServer. Unless they provide the correct authorization, they cannot do so.
The whole JobServer folder for each Service Account is owned by that Service Account. So only a root user, or a user that is allowed to sudo to that Service Account, can see the configuration files.
The JobServer is set up to start under that Service Account.
The Service Account has access to all folders required for Jobs running under that Service Account to work.
Creating a Keytab
You need a Kerberos keytab for each Service Account. The keytab is placed in the home directory of that Service Account, so only processes started under that Service Account will be able to access the keytab file.
As an example, details on how to create a keytab on Windows are available on this ktpass page, but follow the correct process for your operating system.
Setting up the JobServers in TAC
Set up the JobServers in the TAC with the correct ports as shown below.
Setting up JobServer authorization in TAC
Associate the Server in the TAC to the corresponding project(s). You can associate one JobServer to one project, or many JobServers to one project. This depends on the level of granularity you want. In the beginning, it may be easier to assign one JobServer to one project.
Managing access rights
Once you have set up all the projects and servers, limit the Rights of the Operation Manager to only the two items related to Job Conductor, as shown below.
Configuring Job properties
Since the Production TAC only has projects with the No Storage option, you need to provide Jobs as Zip files or from Nexus. When the Job Conductor imports a Zip from disk or from Nexus, it looks at the jobInfo.properties file to know which project to link this Job to. This properties file is provided within the Zip file of the Job. If you need to change the project so that a project label for a different tenant can be attached to the Job, you can modify the project= attribute in this file to match your project in Production. This can be automated through a build process. This is to allow a binary Job built from a project called xyz_dev to be attached to project TENANT1 in Production.
Creating Tasks and Execution
When creating and editing tasks on the Job Conductor, the user will only be allowed to associate a Job to the JobServer that he/she has access to. So a user will never be able to run a Job against a Service Account which he/she does not have access to using the RUN AS feature of TAC.
... View more