In general, profiling data is resource intensive and limited to the resources on the Talend Studio machine. However, if you need to run profiling on a large dataset, you can use Talend Data Profiling to create a report to run an analysis on sample data, then use Talend Data Integration (DI) to run the analysis (which calls the report) on a JobServer. This article shows you how to run an analysis of sample or full data on a Studio machine, and how to schedule the running of reports on the full data, using Talend DI.
For more information on Reports, see the Reports section of the Talend Data Fabric Studio User Guide, available in the Talend Help Center.
Open Talend Studio and login to a Local project.
In the Repository tree view, expand Metadata, right-click Db Connection, select Create connection, and create a database connection. Ensure that the connection test is successful.
Right-click the connection, then select Retrieve Schema and import the table that you need to run analysis on, in this case, Address_Data.
Change the perspective to Profiling, and the connection you created is listed under Metadata.
Expand Data Profiling, click Analysis, then select New Analysis. Expand the Column Analysis folder, then select Basic Column Analysis. Click Next.
For more information, see the Column analyses section of the Talend Data Fabric Studio User Guide, available in the Talend Help Center.
Enter an analysis name in the Name field. (This example uses Address_Data_Column_Analysis.) Click Finish.
The Column Analysis window opens. Click Select Columns.
Select the table columns that you want to profile. In this example, all of the Address_Data table columns are selected.
Select the indicators that apply for the type of analysis.
Enter the number of rows you want to analyze in the Limit field, then select the Run with sample data check box. The Address_Data table has 10,000,000 million records. However, in this example, it only runs the analysis on 50 random rows.
Click Run to run the analysis and return the results within Studio.
Right-click Reports, then select New Report. Give your report a name, then click Finish to create a report.
In the Analysis List view, click Select analyses and select the analysis you just created, in this example, Address_Data_Column_Analysis.
In the Generate Report Settings view, click the [...] button to the right of the Output Folder text box and browse to the folder where you want to save the generated report. Select the output file type from the File Type pull-down menu. Expand the Database Connection Settings view and enter the connection information of the Talend DQ Portal database.
Clear the Run with sample data check box.
Ensure that the Refresh box is selected.
Click Run Report. This runs the profiling on the full data.
Switch to the Integration perspective and create a Standard Job.
Drag a tDqReportRun component into the workspace.
In the Component view, click Browse Reports and select the report you just created.
Run the Job and verify that the report generated.