Creating sample data and building a data matching Job

Overview

This article shows you how to create sample demographic data, that is data containing auto-generated, pseudo-random data of people's names, addresses, and dates of birth, by using freely available code, then building a sample matching Job, all using Talend components.

 

Creating sample demographic data

You can generate sample data by using a tRowGenerator component that allows you to specify an arbitrary number of rows, define a schema, then assign values to the defined columns. However, the random values that are assigned, although useful, are not real-life values that mimic real data.

 

To create data that is as realistic as possible, use the Talend Jobs and routines stored in the LibTBEDataGenerator.zip file, available on the Talend by Example web site. The ZIP file (compatible with all Talend versions) contains all the required Jobs, JAR files, and components needed to use the Job as a base to generate sample data.

 

  1. Download and import the LibTBEDataGenerator.zip file into Studio as a new Talend Job.

  2. Create a copy of the Person_and_Address_Generator Job.

    p1.png

     

  3. Notice that the Job is using tMap components; these components work in three parts. The first (MapGender) component generates a random gender, Male or Female. The second (MapBasic) component generates basic attributes such as name and address. The third, and the one customized in this article, is the MapAdvanced component; it takes all the components and puts them together. The MapAdvanced component mapping is shown below:

    p2.png

    Mapping the MapAdvanced component allows you to generate a large number of fields that contain basic personal and demographic information. The TBEDataGenerator class generates these, as shown below:

    p3.png

    You can customize the class and mapping for your use case. For example, you can add a simple expression to generate a UK mobile phone number randomly. This expression, generates the correct length, from random numbers, as shown below:

    p4.png

     

  4. Once you have the fields you want, initialize and run the Job. The tJavaRow component specifies how many random lines of data you want to generate.

    p5.png

    p6.png

     

  5. Review the output file. In this example, the data is output into a delimited file. You can choose a different output type. Note that the data is randomly generated and that rerunning the Job generates entirely different data.

    p7.png

     

Building a simple data matching Job

This section shows you how to build a simple data matching Job to test your data, and presents a simple example to illustrate how you can use the data. It is not meant as a full description of how to build a complete data matching Job. For a full description on how to build data matching Jobs, including best practices, see the Creating a Job to match data documentation available in Talend Help Center.

 

  1. Create a simple DI/DQ Job using the tRecordMatching component.

  2. Use tFileInputDelimited components to compare the two files that you generated using the random generator Job in the Creating sample demographic data section.

    The input file contains one hundred thousand records and compares them to the reference file of ten thousand records. The data output is sent to three delimited files; one containing the data that matches, one for the data that doesn’t match, and one for the data that may be a potential match.

    p8.png

    The Job includes mapping components, but in this case, they are mapped straight through without any changes. They are there so you can change the data if needed.

    p9.png

     

  3. Use a tRecordMatching component to match the data. This example uses simple matching rules that can be amended to your use case.

    1. In the Block Selection > Input Column, add the Country column. This provides a sufficient blocking strategy and produces blocks of roughly equal size.

    2. Match within those blocks on FirstName and LastName using phonetic matching. This example uses the Soundex and Jaro-Winkler algorithms, but the choice is yours.

    3. Choose arbitrary Weight values for each. Notice that in this example, more weight is given to LastName.

      p10.png

       

    4. In the Advanced settings tab, set the match interval to 0.95 and above (that is, a 95% match) and an unmatch threshold of 90%. These values need to be tuned for each matching set of data.

      p11.png

       

  4. Run the Job, then view the sample results.

    p12.png

     

For more information on data matching, see the three-part blog series Data Matching 101: How Does Data Matching Work?, Data Matching 101: What Tools Does Talend Have?, and Data Matching 101: How Do You Tune Data Matching?.

 

Conclusion

This article demonstrated how to create sample demographic data to help build and test a sample data matching Job. You can put the sample data created in this article to a wide variety of uses.

Version history
Revision #:
15 of 15
Last update:
‎05-22-2019 09:06 AM
Updated by: