Talend MetaServlet is a REST web service that allows you to administer Talend Administration Center (TAC) programmatically. You can perform any TAC function by using a script or by using the API. For example, you can generate, build, and export Jobs developed in Talend Studio onto Job servers, or you can export Services, Routes, and data service Jobs onto Runtime servers.
Talend Metaservlet provides two different access points:
Using the Console:
Using the REST API:
http://localhost:8080/org.talend.administrator/metaServlet? <Base64 encoded UserRequest>
Talend MetaServlet provides different modes supported by runTask and migrateDatabase actions:
The command waits for the response and hence the response is returned in the same call.
The command does not wait for the response, so you need to execute a separate query to get the status of previous commands.
Talend MetaServlet can do a large number of tasks:
Manage Users and Usergroups:
createUser, listUsers, getUserInfo, deleteUser, createAuthorization, userExist, resetPassword, addUsersToUserGroup, createUserGroup, updateUserGroup, removeUsersFromUserGroup, listUserGroups, setRoleLimitation
Create, Update, Delete and List Project, ProjectReferences, Branches:
createProject, deleteProject, projectExist, updateProject, listProjects, createSandboxProject, enableSandboxProject, createServerProjectAuthorization, createTag, deleteProjectReference, createProjectReference, deleteBranch, createBranch
Create, Update , Delete and List Tasks, Triggers, Plan:
createTask, associatePreGeneratedJob, updateTask, deleteTask, getNameByTaskId, getTaskIdByName, runTask, stopTask, createTrigger, updateTrigger, deleteTrigger, listTrigger, requestPauseTriggers, requestResumeTriggers, deletePlan, getTaskExecutionStatus, runPlanFromNode, runPlan, importExecutionPlan, listExecutionPlans, getTaskStatus, getTasksRelatedToJobs, listTasks, taskLog, pauseTask, requestDeploy, requestGenerate
addServer, updateServer, listServer, removeServer, createVirtualServer, updateVirtualServer, addServersToVirtualServer,listVirtualServers, removeVirtualServer, removeServerProjectAuthorization, removeServersFromVirtualServer
Other useful commands:
branchExist, Help, setLicenseKey, getLicenseKey, setLicenseKeyByValue, revokeLicenseKeyByValue, migrateDatabase, /strong>getLibLocation, getUpdateRepositoryUrl
In addition there are a number of useful commands for ESB projects.
An example shell script
The following code is an example of how to use MetaServlet commands in a shell script. This example checks whether a project exists by using the "projectExist" command:
Get some input parameters such as the TAC details:
#Write some input parameters.
# link tac variable
# userId tac variable
# password tac variable
# contextName tac variable
Input some parameters into the shell script. These can be used anytime in the script:
# input parameters for script shell
echo projectName: $1
echo type import: $2
echo nexusArtifactId: $3
echo nexusGroupId: $4
echo nexusJobVersionSuffix: $5
Check whether a project exists by using the "projectExist" Metaservlet command:
#Check whether a project exists
outid=$(/talend/app/Talend-6.4.1/apache-tomcat-8.5.15/webapps/tac631/WEB-INF/classes/MetaServletCaller.sh -url $tac -json-params='
You can extend this script to do whatever you require.
Note: When using the REST API rather than a script, all User Requests (commands) must be Base64 encoded. This is not necessary when using a shell script to issue commands.
... View more
This article describes how to use Talend Data Preparation (Data Prep) as part of a data ingestion pipeline. This is a common business use case in which data from various disparate systems is required to be translated to a common format, standardized, reformatted, validated, and then passed in that standard format to other systems.
This example shows how you can build a Talend Job to ingest data from a database, convert it to a common format, run a Talend Data preparation against that data, and then pass it on to other systems, in this case exporting as both a Comma Separated Value (CSV) and an MS Excel file. Any of these systems can, of course, be replaced by any number of things, but what will be common will be translating the data to a common format and running a data preparation.
Building the Talend Job
The Job you will build is shown below:
Each component will be dealt with separately.
Input the Data
In this example, you will input data from a MySQL database. You set up a database table in MySQL using some example data; a section of the data is shown below. This shows the first part of your use case, getting data from a system in a specific format. You could use any number of different systems, but this example uses a database.
To represent that data in Talend you need to define a schema, as shown below:
You will then use the tMysqlInput component to get the data from this table. The component configuration is shown below:
Transform the Data
You now need to transform the data into the common format you are using in your use case. This common format is shown below, and contains the following columns:
This is your common format. In this use case, you want to translate ALL incoming data into this format. To do so, use a tMap component and map the fields as shown below:
Run a Data Preparation
The next step is to use a Data Preparation to run against the data. To do so, use the tDataprepRun component. On the component configuration tab, you can either use an existing preparation, or create a new one. In this example, you will create a new one.
On the Component tab, click Create a new one, then select the Edit preparation in your browser icon.
The Data Preparation home page window opens in your web browser and the Data Preparation you just created is displayed:
Select the Data Preparation to edit it. You can choose any number of things to do to the data; this example's use case is shown below:
This example shows selecting a few simple functions for a small number of columns in the data. You may choose to follow this example or choose your own.
Manage the Output
In this example, you will send the data to two different places and formats. You will use the tReplicate component to send multiple copies of the output to one comma separated value file and one MS Excel spreadsheet. This is shown below:
Configure the components like this:
Replicate the input/output schema in the tReplicate Component:
Configure the final output components like this:
For the tFileOutputExcel component:
For the tFileOutputDelimited component:
Run the Job
The Job is now built and configured. The next step is to run it, as shown below:
Once run, you should get output similar to this:
Examine the Output
In this example, you have a dataset containing 6040 records. You can now examine the output files, either by opening them or using the Talend Data Viewer:
Comparing this to the input data, you can see that the data has indeed been transformed by running through a Data Preparation.
... View more
The correct selection of a blocking strategy, and in particular, the choice of Blocking Keys is crucial to the success of any data matching project, whether using Talend DQ or MDM. This article will outline the principles that should be used when making such a selection.
Data Matching Overview
Data Matching is based on the idea of sorting. It is very difficult to match articles within large ‘piles’ or ‘blocks’. It is much more efficient to sort those data into smaller blocks of similar attributes. It is akin to the ‘matching the socks problem’. It is much easier and quicker to match socks if you sort them out into a number of piles or blocks of say, similar colors, and then match for identical socks within those blocks. The same principle applies to all data matching problems. Data sets should be first sorted into blocks of similar or identical attributes. However, what is important is that these blocks should all be of a similar size, or as close as can be, and that not too many or too few blocks are chosen.
So, how do you choose those blocks and how many do you need? One principle of data matching states that you should always match on attributes that are ‘unlikely to change’. It is of less use to match a person based on their address, which is likely to change over time; rather than their name, which is not. All things, whether they be persons or objects, have attributes that are less likely to change than others. For an object, such as a cup, those attributes that identify that particular cup could be size, shape, color, and so on. For a person, they are likely to be first name, surname, sex, date of birth, social security number for example. These attributes are much less likely to change than address or phone number. The first step in selecting blocking keys is to select those attributes which you will block on.
In the case of a person, there are likely to be around half a dozen or so attributes you can choose from, and the same number is probably realistic for objects too. This is a good number, not too many and not too few. You can now see this automatically reduces the number of matches that will need to be done from the total into a realistic amount of blocks that need to be handled.
Matching and Blocking in Talend Components
In Talend DI and Talend MDM, there are two different but related ways to set up blocking and matching. The sections below will deal with each of these separately.
Blocking and Matching in Talend DI/DQ
The following DI Components are used to do data matching:
Each of these components is fully described and detailed in the Talend Component Reference Guide. This article is concerned with the tRecordMatching component.
This component joins two tables by doing a match on several columns using a wide variety of comparison algorithms. It compares columns from the main flow with reference columns from the lookup flow and, according to the matching strategy you define, outputs three possible results. These are the matched data, the possible matched data, and the rejected or ‘non-matched’ data. In other words, records that match, those that don't, and those you are uncertain about. The confidence levels of your matching allow you to define matching and non-matching thresholds. On deciding your matching strategy, the user-defined matching scores are critical to determine the match levels that define which one of the three possible outcomes applies. Above the matching threshold, records are matched, below the non-matching threshold records are not matched. Those in between will require manual intervention. Ideally, you want to keep those to a minimum.
As an overview, the tRecordMatching component is configured thus:
The 'Blocking Selection’ columns are in effect the Blocks. There are Matching functions that can be selected within each Block, and there is also a ‘Weight’ assigned to each Matching Function.
Matching is done in the way described above in the theory of matching. Records are matched within the various Blocks, and each Match is rated with a weight, or ‘measure of importance or confidence’ in that match. The choice of matching algorithm within each block is varied, and would depend on the field that is being matched. As described above, you should try to choose fields whose values are unlikely to change. The simple example above shows blocking on Firstname and Lastname. For Matching, we have selected ID Number, Date of Birth, and Sex. Slightly more weight is assigned to ID Number over DOB, as we have more confidence in that value. We assign a low weight to Sex as this can easily be mixed up or missed (in this simple dataset). In this example, we therefore put people with the same names into the same blocks, and then we match within those blocks. Note, this is an example. The strategy adopted is very dependent on the dataset being used. Quality of the data is hugely important.
At the simplest level, the matching works by assigning a probability of the match for each attribute, it then multiplies that by the weight to build up a total weight for each match.
As discussed, there can only be three possible outcomes from any matching Job:
Records that match
Records that don’t match
Records we are uncertain about (and usually need manual checking)
The weights, as you have seen, are used to set those thresholds for matches, non-matches, and uncertain or possible matches. This is done in a Talend Job and is shown below. Here you are matching persons against a list of existing customers to see if they exist already. After matching, there are three different results and these records are sent down three different paths as shown:
From experience, for data containing personal information on people, the best values to be would be (listed in descending order of importance):
Personal Identifier of some sort, for example NI Number or SSN
Date of Birth
Last part of the address
Phone number/email address
The matching strategy being used is very dependent on the data being used. A fundamental principle is that data used for matching should be of high quality. It should be as standardized and complete as possible. For example, there should be no missing fields if possible. Unstandardized data with missing fields will result in a very low quality of matches. The old adage of 'Garbage In = Garbage Out' applies.
Again, the most important thing to remember is that the matching fields you choose should be attributes in the data that are unlikely to change, and therefore uniquely identify the object in question.
Blocking and Matching in Talend MDM
In Talend MDM, matching is done by using a built-in algorithm, which is a type of Entity Resolution algorithm called a ‘Swoosh’ Algorithm, technically the ‘T-Swoosh’ variant of that algorithm.
Blocking is achieved using blocking keys, which can be defined. Data is matched to reference data on the MDM server, it is then standardized, blocked using blocking keys, and then matched using the T-Swoosh algorithm. Again, three possible outcomes are possible. In MDM we define them as Matches, Unmatched, and Suspects. The diagram below illustrates the MDM matching process.
Blocking keys in MDM are defined for each entity that will need probabilistic matching. To do this, you need to create a new element in the MDM Data Model to store the Blocking Key value. Note: it should be made non-visible in the MDM Web UI, since you don’t want anyone changing it.
The next step is to create a trigger to automatically compute a new blocking key when creating/editing a new entity instance. In the example below, a blocking key is generated for the person “John Smith” based on his name and phone number.
The generated key uses the first letter of his First name, the first three letters of his Surname and the first part of his phone number. The blocking key thus generated is “JSMI0141”.
Now of course, there is a big trade off here. Blocking Keys must create groups small enough to lower the amount of comparisons that MDM needs to do, but blocking groups must be big enough not to drop “true positives”. In a nutshell, it is all about Accuracy vs Performance.
The choice of blocking key is very dependent on the fields in the data. As already discussed, ideally you should use attributes that are unlikely to change. Name is a good choice, but phone number (as in the example above) is not always the best choice. How often do people change their phone numbers? A better choice would be to include date of birth or sex, if these fields exist. It’s all about what is in the data and selecting the best fields for matching.
Another consideration is not to choose fields that can be nullable. This will seriously affect the performance of your matching, as they will produce keys that are incomplete. Also, do not choose fields that are primary keys in your database.
In conclusion, there are a number of considerations to make when selecting a Blocking strategy for your data matching. This article outlined the most important items that need to be considered and should be followed.
... View more