How to run Talend Data Profiling analysis on large datasets

Overview

In general, profiling data is resource intensive and limited to the resources on the Talend Studio machine. However, if you need to run profiling on a large dataset, you can use Talend Data Profiling to create a report to run an analysis on sample data, then use Talend Data Integration (DI) to run the analysis (which calls the report) on a JobServer. This article shows you how to run an analysis of sample or full data on a Studio machine, and how to schedule the running of reports on the full data, using Talend DI.

 

For more information on Reports, see the Reports section of the Talend Data Fabric Studio User Guide, available in the Talend Help Center.

 

Running analysis on sample data

  1. Open Talend Studio and login to a Local project.

  2. In the Repository tree view, expand Metadata, right-click Db Connection, select Create connection, and create a database connection. Ensure that the connection test is successful.

    dbconn.png

     

  3. Right-click the connection, then select Retrieve Schema and import the table that you need to run analysis on, in this case, Address_Data.

    Retr_schema.png

     

  4. Change the perspective to Profiling, and the connection you created is listed under Metadata.

    persp.png

     

  5. Expand Data Profiling, click Analysis, then select New Analysis. Expand the Column Analysis folder, then select Basic Column Analysis. Click Next.

    NewAnalys.png

    For more information, see the Column analyses section of the Talend Data Fabric Studio User Guide, available in the Talend Help Center.

     

  6. Enter an analysis name in the Name field. (This example uses Address_Data_Column_Analysis.) Click Finish.

    anasy_name.png

     

  7. The Column Analysis window opens. Click Select Columns.

    SelClmns.png

     

  8. Select the table columns that you want to profile. In this example, all of the Address_Data table columns are selected.

    ClmnSel.png

     

  9. Click Select Indicators.

    selInd1.png

     

  10. Select the indicators that apply for the type of analysis.

    selInd2.png

     

  11. Enter the number of rows you want to analyze in the Limit field, then select the Run with sample data check box. The Address_Data table has 10,000,000 million records. However, in this example, it only runs the analysis on 50 random rows.

    runsamp.png

     

  12. Click Run to run the analysis and return the results within Studio.

    Studiorun.png

     

  13. Right-click Reports, then select New Report. Give your report a name, then click Finish to create a report.

    newrep.png

     

  14. In the Analysis List view, click Select analyses and select the analysis you just created, in this example, Address_Data_Column_Analysis.

    repset.png

     

  15. In the Generate Report Settings view, click the [...] button to the right of the Output Folder text box and browse to the folder where you want to save the generated report. Select the output file type from the File Type pull-down menu. Expand the Database Connection Settings view and enter the connection information of the Talend DQ Portal database.

    repset2.png

     

  16. Click the Check button to the right of the Db Type pull-down menu to ensure the connection is successful.
  17. Save the report.
  18. Run the report, then review the generated report file.

    repgen.png

     

Running analysis on full data

  1. Clear the Run with sample data check box.

    unchsample.png

     

  2. Save the analysis.
  3. Ensure that the Refresh box is selected.

    refresh.png

     

  4. Click Run Report. This runs the profiling on the full data.

    runreop.png

     

Operationalizing analysis report

  1. Switch to the Integration perspective and create a Standard Job.

    rundq.png

     

  2. Drag a tDqReportRun component into the workspace.

  3. In the Component view, click Browse Reports and select the report you just created.

    tdqrun.png

     

  4. Run the Job and verify that the report generated.

    runresult.png

Version history
Revision #:
16 of 16
Last update:
‎06-18-2019 06:10 AM
Updated by:
 
Contributors