This article introduces the Talend Data Quality (DQ) profiling analysis types, and explains what each type does.
Talend Data Quality profiling tools, such as prebuilt analysis types and routines, help you profile your data. Talend recommends using Talend Data Quality first to improve the quality of your data, because incomplete, low quality datasets produce poor analysis results, and data profiling is best done on complete, clean datasets.
Data profiling is about collecting and analyzing information about your data. It is important because it allows you to detect and understand data issues, then fix them. Data profiling consists of three phases:
Structure Discovery—Analyzing the structure of your data helps you discover if your data is consistent and in a valid format. This type of analysis uses simple statistics to provide information about the structure and validity of the data.
Relationship Discovery—Discovering the relationships within your data helps discover connections between different data sets.
Content Discovery—Content discovery focuses on data quality. For example, to be useful, data needs to be standardized and formatted correctly and completely. This type of analysis helps you discover this information.
For data analysis, the details of the analysis routines are documented in the relevant Talend product documentation, specifically the Data Quality User Guide, but this article attempts to explain why you would choose specific analysis types.
In Data Profiling you use a sample of the data for analysis. Generally, it is not done on the entire dataset, especially if it consists of a large amount of data.
From the Profiling perspective of Studio, select Data Profiling and right-click Analyses.
Select New Analysis to build a new DQ analysis. You can also create folders to organize your analyses and import/export analyses. In Talend DQ 7.1, there are five main groups of analyses types, and within each type, there are a varying number of analyses routines.
It is important that you choose the correct type for your business needs and that you understand precisely what each type does. This section covers each analyses type.
This analysis allows you to build an overview of the structure of your database, catalog, or schema by analyzing the database content. It can show you which tables are full, which are empty, and how the structure is built; it also analyses key structural elements like the number of tables in your DB, the number of rows per table, the number of primary keys, and the number of indices. You can use this on the complete database, or a representative sample of tables.
Use this type of analysis to gain insight into the structure of your database.
Use this analysis to discover the relationship between tables, or to discover the relationship between foreign keys. The Cross Table Analysis has only one analysis routine, that is, Redundancy Analysis.
You can use Redundancy Analysis in two ways: you can use it to compare identical columns in different tables, or you can use it to match foreign keys in one table to primary keys in the other table, and vice versa. Note that Redundancy Analysis only supports database tables.
The main use of Table Analysis is to analyze the data in single tables. However, it is possible to analyze data in multiple tables, as explained in the Data Quality User Guide, but that generally involves creating business rules containing where clauses. It is important to note that you will create a "dependencies analysis" that analyses any differences and dependencies between the two tables. These results can show whether the factors are determined by, or dependent on, the relationship between the tables.
There are four analysis routines available within Table Analysis, including Match Analysis and Business Rule Analysis. These analysis routines are often used more than others, as they can be vital in discovering your data. Match Analysis, for example, allows you to identify how many duplicates are in your data.
The one analysis that differs slightly from the others is the Column Set Analysis. This analysis works on a column set, and not on separate columns like Column Analysis. The statistics presented in the analysis results are the row count, distinct count, unique count, and duplicate count that apply on all the records in the set. The Java engine can also apply patterns on each column, and the analysis returns the number of records that match all of the patterns on the list.
However, use this analysis with caution. The analysis retrieves as many rows as the number of distinct rows available within the analysis, and therefore this could be a considerable number.
An important note: you can create DQ rules, and assign them to the analysis at a later stage.
This analysis type allows you to examine patterns and frequencies within columns of data. There are six different types of analysis routines to choose from, and they work against single columns in your data.
These analysis types help you examine columns in databases and also in delimited files. You can run this analysis on several columns, but each column is analyzed separately, without reference to any other.
There are different types of pattern, frequencies, and statistical analyses that can be performed, and these are all fully documented in the Data Quality User Guide.
Note that if you use the Java engine to run a Column Analysis on big datasets, or on data with numerous data issues, such as corrupt or missing data, then the analysis can be very resource-hungry, requiring a lot of memory. Talend recommends defining a high maximum memory size threshold before you run this analysis.
One important type of analysis routine is Semantic-aware Analysis. This analysis type uses a wizard that automatically configures a Column Analysis with the appropriate indicators after learning your data. The wizard tries to guess, within limits, the correct meaning of each column and it finds the related concept in the semantic repository. So, being provided with a concept, certain applicable indicators can automatically be found and added to the columns.
It is possible to enrich the semantic repository with your entries, and this will therefore enrich the semantic repository for future use. This in effect builds up the semantic repository over time, making it more useful for whoever uses that repository.
This analysis works against multiple columns and allows you to discover correlations between multiple columns. In each of the analysis routines, there are a number of chart types available to visualize your data. These routines include numerical and time-based correlation analysis.
Two things to note:
The Nominal Correlation Analysis routine is a very useful type of analysis, as it is used to analyze minimal correlations between nominal columns in the same table. These correlations can be minor, and therefore this can be very useful in your data discovery process. Charts are used to display the results, and correlations between nominal values are represented by lines. The thicker or thinner the line (depending on the type), the weaker the association. Therefore, this analysis type can be used to identify problems or correlations that may need attention.
The correct choice of analysis type is crucial when you do data profiling. You need to understand what each one does, why you would use them, and what they do. This is all dependent upon your business use case, and correctly applying the right analysis type will help you maximize the potential of your Data Profiling.