Data Profiler execution methodology

Highlighted
One Star

Data Profiler execution methodology

Hi,

I am new to Talend and trying to explore and understand Data Profiler.
The Profiler version I am using right now is 5.5.1 under Windows 32 bit machine with 2GB RAM.

When I am trying to profile a table under Microsoft SQL Server with around 2+ Million records (Table Analysis -> Match Analysis), either my machine shows BLUE screen and shut down or the applications hangs for indefinite time till I force shut down application.

With reference to above scenario, I would like to understand below aspects of Talend Data Profiler.

1. Does Talend Data Profiler extract all data from source DB or file to machine running profiler and profile it there or data is profiled at Database side.

2. I could find that Data Profiler under Open Studio is a client application. Is this a cause for tool not able to profile such huge data.

3. In order to profile data which is huge in size (say 25 Million records in a table) what needs to be done.

Regards,
NiX
Four Stars

Re: Data Profiler execution methodology

Hi Nix,

1. Does Talend Data Profiler extract all data from source DB or file to machine running profiler and profile it there or data is profiled at Database side.
>> Yes, it extract all data from source and puts in memory for processing..

2. I could find that Data Profiler under Open Studio is a client application. Is this a cause for tool not able to profile such huge data.
>> Many time memory allocated to the job or JVM is the root cause for not processing the large data

3. In order to profile data which is huge in size (say 25 Million records in a table) what needs to be done.
>> If the job is not processing at all, then split the data into small chunks and then you can check...

Hope it helps

thanks
vaibhav
One Star

Re: Data Profiler execution methodology

Hi Vaibhav,

Thank you for the quick response and precise answers.

Could anyone please share the largest data they have profiled along with system configuration.

Regards,
NiX
Employee

Re: Data Profiler execution methodology

Hi,
about 1. Talend does "in place profiling" by default. This means that no data is retrieved by the studio. Only the results of the statistics.
This way, several millions of rows can be easily profiled (because it the database that runs the SQL queries).
See our documentation: https://help.talend.com/pages/viewpage.action?pageId=39878790

3. 25 millions of rows can be profiled by column analysis without any problem. 
There may be some memory issue with some kinds of analyses though (column set analysis for instance when there are too many distinct rows).  But we're actively working on these issues.

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Enabling Data Governance

Learn how to enable Data Governance

Watch Now

The Definitive Guide to Government Data Quality

Take a peek at the definitive guide to Government Data Quality

Read