Using User Defined Indicators over multiple columns - duplicate row de

One Star djm
One Star

Using User Defined Indicators over multiple columns - duplicate row de

I've recently downloaded TOP (5.0.0.M3_r66102) and I'm in the process of doing a tools assessment. One of the assessment scenarios is to be able to identify multiple instances of a row where the duplication check has to be made over multiple columns. While I know the "Nominal Correlation Analysis" can be used to identify the rows, (a) it appears to generate 4 distinct SQL statements and (b) the SQL it uses is not necessarily optimal.
I had thought that I'd be able to use a User Defined Indicator, making use of COLUMN_NAMES in the template. However, I can only seem to use a User Defined Indicator against a single column. I'd like to associate a single instance of it with multiple columns and have the associated columns be expanded for the COLUMN_NAMES. Is a User Defined Indicator only able to be associated with a single column or have I missed something in the interface?
By means of an example, lets assume I have a table named "table1" populated as follows:
col1 col2 col3
---- ---- ----
A B 10
A B 12
A C 42
Oracle's analytical functions can be used to to detect duplicates reasonably effectively. I'd like to define my (Oracle) SQL along these lines:
SELECT full_set.row_count AS multiple_rows_count,
COUNT(*)
FROM
(
SELECT col1,
col2,
COUNT(1)
OVER
(
PARTITION BY col1,
col2
) AS row_count,
ROW_NUMBER()
OVER
(
PARTITION BY col1,
col2
ORDER BY 1
) AS row_instance
FROM table1
) full_set
WHERE
(
-- Restrict the rows returned to only those with duplicates.
(full_set.row_count > 1)
-- Restrict the rows returned to the first instance of duplicated rows.
AND (full_set.row_instance = 1)
)
with col1 and col2 collectively forming the COLUMN_NAMES values. If I was able to specify a User Defined Indicator thusly:
SELECT full_set.row_count AS multiple_rows_count,
COUNT(*)
FROM
(
SELECT <%=__COLUMN_NAMES__%>,
COUNT(1)
OVER
(
PARTITION BY <%=__COLUMN_NAMES__%>
) AS row_count,
ROW_NUMBER()
OVER
(
PARTITION BY <%=__COLUMN_NAMES__%>
ORDER BY 1
) AS row_instance
FROM <%=__TABLE_NAME__%>
) full_set
WHERE
(
-- Restrict the rows returned to only those with duplicates.
(full_set.row_count > 1)
-- Restrict the rows returned to the first instance of duplicated rows.
AND (full_set.row_instance = 1)
)
I could reuse this duplicate detection logic in multiple places. Moreover, if I was able to associate a single User Defined Indicator with a collection of columns, I could further leverage the capability of analytical functions to do other arbitrary data quality related logic.
Any comments or observations welcomed.
Regards,
David
One Star

Re: Using User Defined Indicators over multiple columns - duplicate row de

Hi David,
Your observation is correct, one cannot have a combination of columns selected in the UDI. It would open the tool to so many more analysis possibilities if that would be available.
I had a similar issue where we wanted to figure out the duplicates based on a combination of columns (in our case 5 columns). After some tweaking we figured we essentially wanted to check on one column (still with the combination of columns).
If I take your example it would mean we would know that col1 had duplicates (all "A") and we wanted to check on col2 how many duplicates we have. Based on that you could change your UDI to (I don't know if I use the correct syntax, but I'm sure you will understand what I mean):
SELECT full_set.row_count AS multiple_rows_count,
COUNT(*)
FROM
(
SELECT col1, <%=__COLUMN_NAMES__%>,
COUNT(1)
OVER
(
PARTITION BY col1, <%=__COLUMN_NAMES__%>
) AS row_count,
ROW_NUMBER()
OVER
(
PARTITION BY col1, <%=__COLUMN_NAMES__%>
ORDER BY 1
) AS row_instance
FROM <%=__TABLE_NAME__%>
) full_set
WHERE
(
-- Restrict the rows returned to only those with duplicates.
(full_set.row_count > 1)
-- Restrict the rows returned to the first instance of duplicated rows.
AND (full_set.row_instance = 1)
)

In your column analysis you can then just select col2 as the field to be analyzed. It's basically a count of duplicates grouped by col1.
Keep in mind that you cannot drill into the data (show rows) of a UDI (at least not in 4.2.2). This is one thing I hope Talend will add soon.
I'm curious to see more feedback on this topic.
One Star djm
One Star

Re: Using User Defined Indicators over multiple columns - duplicate row de

Hi Manfred,
While your suggestion for hard-coding "col1" into the SELECT and then apply the UDI to "col2" would work for the sample table, I'm assessing the tool for the ability to define a pattern that can be reapplied many times. I'd hope to be able to define a suitably generic template that can be used across multiple tables. Hard-coding "col1" would defeat that purpose. Nonetheless, it's helpful to know that UDI's can only be associated to a single column and it wasn't that I had missing something.
Until you pointed it out, I hadn't appreciated that using a UDI prevented a drill-down into the rows. Doing some scanning of the forum suggests that it used to be possible but the capability was removed. Something to do with being too hard to interpret the SQL to obtain the result set from the summarised information. I'd suggest that this difficulty could be overcome with (for the SQL option) a minor tweak to the UDI "Indicator Definition" window, as described below.
Along the lines of
1) Move "Indicator Category" prior to the "Indicator Definition". The Indicator Category defines the columns expected in the result set.
2.1) Within the "Edit Expression" window, extra input fields should be available, one for each column expected per the "Indicator Category".
2.2) Each extra field would then contain the expression to be used for the given column
2.3) The existing expression field would then be used for just the SQL required to obtain the result set.
2.4) Probably need to have a separate (though optional) input field for a "GROUP BY" clause, which would be applied as a suffix to 2.3)
3) The current Open Profiler functionality would be achieved by it building the necessary SQL from the constituent elements in the window. It would need to wrap an outermost "SELECT" and "FROM" around the expression provided within 2.2) and use the "group by" clause field from 2.4).
4) The "view rows" capability would then be able to use the logic within 2.3) to retrieve the underlying data.
Of course, it's is always simpler to conceptualise a solution than it is to code it!
Regards,
David
One Star djm
One Star

Re: Using User Defined Indicators over multiple columns - duplicate row de

I created two new requests to cover these features;
1) Amend UDI definition to allow drill-down into result sets http://jira.talendforge.org/browse/TDQ-3490
2) Allow a UDI to be applied to a group of columns within a table http://jira.talendforge.org/browse/TDQ-3519
Feel free to follow the link and vote for the features if you think they are worthwhile.
Regards,
David
One Star

Re: Using User Defined Indicators over multiple columns - duplicate row de

Thanks David,
you got my votes Smiley Happy