Running DQ metrics using hand-written SQL scripts

Highlighted
One Star

Running DQ metrics using hand-written SQL scripts

Hi everyone,
My team and I are trying to run a set of hand-written SQL scripts within Talend Open Studio for Data Quality to produce data quality metrics. Is it possible to feed Talend a bunch of already written SQL scripts to act as metrics? Or do we have to configure all our scripts again within Talend Open Studio? If we redo them in Talend, any tutorials on how to do this?
We have found that our team is able to write SQL scripts by hand to create metrics much faster and accommodate complex business rules than using the UI in Talend Open Studio. Therefore, we're trying to evaluate whether this tool is worth our time.
Appreciate any feedback from the community. Thanks.
Moderator

Re: Running DQ metrics using hand-written SQL scripts

Hi,
We have redirected your issue to our DQ expert and will keep you posted.
Many thanks for your time.
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Employee

Re: Running DQ metrics using hand-written SQL scripts

The way to add indicators in the profiler is to go through the creation of a user defined indicator.
https://help.talend.com/search/all?query=Managing+user-defined+indicators&content-lang=en
But indeed, that’s not exactly what you would like.
 
Unfortunately, we cannot integrate arbitrary SQL script in the profiler.  The reason is that we are currently limited to single column indicators and the UI depends on the returned results.
I don't know how your scripts look like. Do they apply to multiple columns? What are the returned results? How do you want to display the results?
Some of the advantages of using Talend are:
- no need to write SQL queries (of course you have already your queries, so not concerned here)
- drill down into data based on a given indicator results
- define quality thresholds on most indicators (ensure the quality of the data)
- graphical display of the results
- Automatic job creation from results (Only available in Platform Edition)
- store indicators results in a data mart in order to historize the evolution of the quality of data. (Only available in Platform Edition)
- pdf reporting on indicators (Only available in Platform Edition)
- Shared web access to these reports. (Only available in Platform Edition)
One Star kf
One Star

Re: Running DQ metrics using hand-written SQL scripts

Hi both,
Thank you for your time and clarification. it's unfortunate Talend for Data Quality does not offer a way to post raw SQL into it to create metrics. I would imagine having some criteria for the SQL (e.g. has to return a single count value) provided to Talend would suffice. We really do like to leverage the reporting, tracking, script versioning and accessibility of data quality tools such as Talend.

To give you context, we are a small technical team looking at implementing some data quality metrics against an Oracle database. We have already created data quality scripts that we store in a database and we use an ETL tool to run these scripts periodically to get metrics. At this point we realize there are tools out there that can provide these features and probably more. However, we want something that is an easy transition from our current scripts to a data quality platform. Reimplementing all our scripts in the new platform will take a lot of resources and time.

Here's an example of scripts we write for metrics:
Find addresses that have invalid country information that is not found in our reference country view. Only consider address records that have a matching customer or potential customer record.
SELECT 
COUNT(1)
  FROM ADDRESS  ADDR
       LEFT JOIN COUNTRY_REFERENCE_VIEW VW
           ON TRIM (VW.DESCRIPTION) = TRIM (ADDR.COUNTRY)
 WHERE TRIM (ADDR.COUNTRY) IS NOT NULL
       AND VW.DESCRIPTION IS NULL
       AND (   EXISTS
                   (SELECT 1
                      FROM CUSTOMER CUST
                     WHERE CUST.CUSTOMER_ID = ADDR.CUSTOMER_ID)
            OR EXISTS
                   (SELECT 1
                      FROM PROSPECTIVE_CUSTOMER P_CUST
                     WHERE P_CUST.UBC_ID = ADDR.CUSTOMER_ID));