Talend Data Preparation is a self-service application that enables you to simplify and expedite the time-consuming process of preparing data for analysis or other data-driven tasks.
This article explains the best practices that Talend suggests you follow when working with Talend Data Preparation.
While working with large datasets, various inputs, and large teams, it is important to classify datasets and preparations. Talend recommends the following best practices to categorize the artifacts.
Although naming conventions depend on the person or organization, following naming conventions makes it significantly easier for subsequent generations to understand what the system is doing and how to fix or extend the source code for new business needs. While working with Data Preparation, the best practice is to follow the agreed naming standards for the folders, preparations, datasets, and contexts variables.
Use the following guidelines to name folders for Preparations:
Use camel case
Separate with underscores
Do not use whitespace
Use only alphanumeric characters
Avoid general folder names
Avoid short forms
Preparations and datasets are typically local to a project, so you can set their naming conventions either globally at the organization level, or locally at the project level. Ensure that the naming conventions are strictly followed. Some guidelines are:
Extracted source name
Prefix or suffix dataset extracted date
Guidelines for using context variables while calling data preparations from Talend Data Integration or Big Data Jobs are:
Create additional contexts for project-specific requirements
Limit the number of additional contexts you create to less than three new contexts per project
Instead, opt for a common context group
Context variables must be descriptive
Avoid one-character context variables, for example, a, b, c
Avoid generic names like var1 or var2
Use folder structures to group items of similar categories or behaviors. As this is completely related to individual products, Talend recommends that you define the folder structures in the project's initial phases. Figure 6 is an example of a folder structure followed in a bank. The folders are divided according to the unit of module. Group datasets by:
Data profiling and data discovery allow you to analyze and identify the relationships between your data. This section explains some of the best practices for discovering and profiling data.
Picking the right data is about finding the data best suited for a specific purpose. It is important to note that this should not only be about finding the data you need right now, but it should also make it easier to find data later, when similar needs arise. Best practices for picking the right data are:
Explore and find the data best suited for a specific purpose
Avoid data with multiple nulls or same/repeated values
Select values close to the source - avoid calculated or derived values
Avoid intermediate values
Extract data across multiple platforms
Determine data suitability (for example, discovery, reporting, monitoring, and decision making)
Filter data to select a subject that meets the rules and conditions
Know the source of the data so that you can source it repeatedly
Figure 7 shows some guidelines for what to avoid while picking the right data. This sample dataset of 10,000 employee income records has multiple null values, negative values for defaulters, and repeating names and addresses. This data does not look good, and thus should be discarded. Bring in additional sample data to ensure you are picking the right data.
Understanding data is essential in assessing data quality and accuracy. It is also important to check how the data fits with governance rules and policies. Once you understand the data, you can determine the right level of quality for the data. Best practices for understanding the data are:
Learn data, file, and database formats
Use visualization capabilities to examine the current state of the data
Spot irregularities and inconsistencies in the data
Use profiling to generate data quality metrics and statistical analysis of the data
Understand the limitation of the data
As highlighted below, Talend Data Preparation assists in the process of understanding data.
Data preparation always starts with a raw data file, which comes in many shapes and sizes. Mainframe data is different than PC data, spreadsheet data is formatted differently than web data, and so forth. In the age of big data, there is a lot of variance in source files.
Ensure that the data types used are accurate. You need to look at what each field contains. For example, it is good to check that if a file is listed as a number, it contains a number, not the phone number or postal code. Likewise, a character file should not contain all numeric data.
Data preparation shows the successful read input with the data type as shown in Figure 9.
Note: by using a data dictionary, you can set the type needed for every column.
Data integration involves combining data residing in different sources and providing users with a unified view of them. Talend Data Preparation provides a platform where you can integrate data while discovering and profiling. This section explains some of the best practices to keep in mind while integrating data.
Once you have assessed the data's quality and accuracy, and have determined the right level of quality for the purpose of the data, as a best practice you must improve the data by:
Cleansing the data
Noting missing data
Performing identity resolution
Refining and merging-purging the data
Data Preparation offers numerous functions for improving the data as shown in Figure 10.
A powerful feature of Data Preparation is the ability to integrate datasets. This takes data preparation to the next level, as now a business can perform simple joins and lookups while preparing rules. As a best practice, integrate data to suit the following needs:
Validating new sources
Integrating and blending data with data from other sources
Restructuring the data according to the needed format for business intelligence, integration, blending, and analysis
Transposing the data
The following screenshot is an example of combining two datasets in Data Preparation.
The following best practices describe the techniques to keep in mind while cleansing, standardizing, and shaping data.
Talend Data Preparation is a powerful tool enabling business users to transform their data. Most of the simple yet important transformations can now be applied with simple clicks. As a best practice, Talend recommends:
Creating generalized rules to transform data
Applying transformation functions to structured and unstructured data
Enriching and completing the data
Determining the levels of aggregation needed to answer business questions
Using filters to tailor data for reports or analysis
Incorporating formulas for manipulation requirements
While making preparations, ensure that the data is accurate and that it makes sense. This is quite an important step and requires some knowledge of the subject area that the dataset is related to. There is not a specific approach to verifying data accuracy.
The basic idea is to formulate some properties that you think the data should exhibit, and test the data to see if those properties are satisfied. Essentially, you are trying to figure out whether the data really is what you have been told it is. In this example, the ID always has to be an 18-digit number, so there is a preparation to validate the ID length.
Outliers are problematic because they can severely compromise the outcome. For example, a single outlier can have a significant impact on the value of the mean, because the mean is supposed to represent the center of the data. In a sense, this one outlier renders the mean useless.
Outliers are data points that are distant from the rest of the distribution. They are either very large or very small values compared with the rest of the dataset.
When faced with outliers, the most common strategy is to delete them. However, it depends on the individual project requirements.
Talend Data Preparation identifies the outliers by making it easier for the following functions to be applied, as shown in Figure 14.
Data enrichment is a value adding process; this process provides more information about the data to the customer. Use the methods given below to enrich data.
Missing values can cause a potential risk to the data being analyzed. They are probably one of the most common data problems you will encounter. As a best practice, Talend recommends that you resolve the missing values. The method depends on the project, but you can:
Replace the missing values with an appropriate value
Replace them with a flag to indicate a blank
Delete the row/record
Reusability is the best reward in the coding world. It saves a lot of time and effort and makes the whole software development lifecycle easier. With Talend Data Preparation, you can share the preparations and datasets with individual users, or with a group of users. Best practices include:
Sharing and reusing data preparations
Placing the shareable preparation in a shared folder, thereby enabling collaborative work
Follow the methods given below to secure data while working with Talend Data Preparation.
As a best practice, masking is an excellent way to protect sensitive data such as names, addresses, credit cards, or social security numbers. To protect the original data while having a functional substitute, you can use the Mask data (obfuscation) function.
Adding versions to your preparation is an excellent way to see the differences that have been made to the preparation over time, but they also ensure that it is always the same state of a preparation that is used in Talend Jobs. Even if the preparation is still being worked on, versions can be used in Data Integration as well as Big Data Jobs.
Capture the state of your preparation by creating a version, as shown in Figure 18.
Preparation versions are propagated when sharing or moving a preparation across your folder structure, but not when you copy it or apply it to a new dataset.
Talend Data Preparation logs allow you to analyze and debug the activity of Talend Data Preparation. By default, Talend Data Preparation logs in two different places: in the console and a log file. The location of this log file depends on the version of Talend Data Preparation that you are using:
Data_Preparation_Path/data/logs/app.log for Talend Data Preparation
AppData/Roaming/Talend/dataprep/logs/app.log for Talend Data Preparation Free Desktop on Windows
Library/Application Support/Talend/dataprep/logs/app.log for Talend Data Preparation Free Desktop on MacOS
As a best practice, Talend recommends that you change the default location of the log file, which can be configured by editing the logging.file property of the application.properties file.
Your data is stored in different locations, depending on the version of Talend Data Preparation you are using.
Talend Data Preparation
If you are a subscription user, nothing is saved directly on your computer.
Sample data is cached temporarily on the remote Talend Data Preparation server, to improve the product responsiveness. In addition, CSV and Excel datasets are stored permanently on the remote Talend Data Preparation server.
Talend Data Preparation Free Desktop is meant to work locally on your computer, without the need of an internet connection. Therefore, when using a dataset from a local file such as a CSV or Excel file, the data is copied locally to one of the following folders, depending on your operating system:
OS X: /Users/your_user_name/Library/Application Support/Talend/dataprep/store
A center of excellence is a group or team that leads other employees and the organization as a whole in some particular area of focus such as a technology, skill, or discipline. As a best practice, build a center of excellence as suggested below.
As you deal with raw data, Talend recommends that you build knowledge while you analyze the data. You can:
Discover and learn data relationships within and across sources, and find out how the data fits together
Use analytics to discover patterns
Define the data by collaborating with other business users to define shared rules, business policies, and ownership
Build knowledge with a catalog, glossary, or metadata repository
Gain high-level insights to get the big picture of the data and its context
While it is important to build and enhance your knowledge, it is equally important to document the gained knowledge. In particular, every project must maintain a document for:
Source data lineage
History of changes applied during cleansing
Relationships to other data
Data usage recommendations
Associated data governance policies
Identified data stewards
As you analyze and understand your data, Talend recommends that you store it in a data dictionary. This helps other users identify the data they are working with, and establish the relationships between various data.
A data dictionary is a metadata description of the features included in the dataset
In Figure 19, the input file has a column language. At the onset when the input is read, the columns with two languages are marked as invalid.
Using the data dictionary, when you change the metadata to accept more than one language as valid input, Data Preparation shows it as a valid record.
Backing up Talend Data Preparation and the Talend Data Dictionary on a regular basis is important to ensure you can recover from a data loss scenario, or any other causes of data corruption or deletion.
To create a copy of the Talend Data Preparation instance, back up MongoDB, the folders containing your data, the configuration files, and the logs.
Talend Dictionary Service stores all the predefined semantic types used in Talend Data Preparation. It also stores all the custom types created by users, and all the modifications done on existing types.
To back up a Talend Dictionary Service instance, back up MongoDB, and the changes made to the predefined semantic types.
Talend Data Preparation lets you operationalize the recipes you will use in Talend Studio. This section covers the best practices for operationalizing.
The best practice when using Talend Data Preparation is to set up one instance for each environment of your production chain.
Talend only supports promoting a preparation between identical product versions. To promote a preparation from one environment to the other, you have to export it from the source environment, then import it back to your target environment. For the import to work, a dataset with the same name and schema as the one that the export was based on must exist on the target environment.
Sometimes the transformations are either too complex or too bulky to be created in a simple form. To help you in such scenarios, Talend offers a hybrid preparation environment. As a best practice, leverage Studio to create real time datasets, and use these datasets for preparations.
Leverage the tDatasetOutput component for output in Create mode
Figure 21 shows the tDatasetOutput component properties:
Running the Job creates the dataset in Talend Data Preparation as shown below.
The tDataprepRun component allows you to reuse an existing preparation, made in Talend Data Preparation, directly in a Data Integration Job. In other words, you can operationalize the process of applying a preparation to input files that have the same model.
The figure below shows the usage of a preparation/recipe in a Talend Job.
You can select a specific preparation as shown below.
Or you can specify a dynamic preparation as shown in Figure 25. By using a dynamic preparation with context variables, you could build a single Job template to use across projects/organizations.
Note: To use the tDataprepRun component with Talend Data Preparation Cloud, you must have the 6.4.1 version of Talend Studio installed.
What if your business does not need sampling, but needs real live data for analysis? Because the Job is designed in Talend Studio, you can take advantage of the full palette of components and their Data Quality or Big Data capabilities. Unlike a local file import, where the data is stored in the Talend Data Preparation server for as long as the file exists, a live dataset only retrieves this sample data temporarily.
It is possible to retrieve the result of Talend Cloud flows that were executed on a Talend Cloud engine, as well as on remote engines.
Use a preparation as part of a Data Integration flow, or a Talend Spark Batch or Streaming Job in Talend Studio.
The live dataset feature allows you to create a Job in Talend Studio, execute it on demand using Talend Cloud as a flow, and retrieve a dataset with the sample data directly in Talend Data Preparation Cloud.
The screenshots below show an example of a Job creating a live dataset:
Note: To create live datasets, you must have the 6.4.1 version of Talend Studio installed, patched with at least the 0.19.3 version of the Talend Data Preparation components.