How to filter the row data by comparing with each row

Five Stars

How to filter the row data by comparing with each row

Hi there,

 

I’d like to update original data with delta data.
(The original data is named “sample_org.tsv”, the delta is “sample_delta.tsv”)

 

Original Data

id name timestamp
------------------------------------------
1 Jorn 2019-05-11 12:49:01
2 Ken 2019-05-11 13:39:11
3 Smith 2019-05-11 17:53:33
4 Taro 2019-05-12 13:42:21
5 Hanako 2019-05-12 17:55:06

 

Delta Data
2 Kana 2019-05-13 11:54:21
3 Kenji 2019-05-13 11:54:51
6 James 2019-05-13 11:55:06

First of all, I united the both data files into one file with tUnite component.
After that, I separated the data into two groups using tMatchGroup; Unique and Match(It can be named Duplicate).
(The unique data file is named “sample_unique.tsv”, the duplicated is “sample_dup”)

 

Unique Data
5 Hanako 2019-05-12 17:55:06
6 James 2019-05-13 11:55:06
4 Taro 2019-05-12 13:42:21
1 Jorn 2019-05-11 12:49:01

 

Duplicate Data
2 Kana 2019-05-13 11:54:21
2 Ken 2019-05-11 13:39:11
3 Kenji 2019-05-13 11:54:51
3 Smith 2019-05-11 17:53:33


And then, I want to filter the Duplicate Data on the same ID with timestamp.
I want to delete the old data.

 

Duplicate Data
2 Kana 2019-05-13 11:54:21
2 Ken 2019-05-11 13:39:11  deleted because this row data is older than the row data above.
3 Kenji 2019-05-13 11:54:51
3 Smith 2019-05-11 17:53:33 deleted because this row data is older than the row data above

 

After this filtering, I want to unite the Unique Data with the processed data above.


What component and how can I realize this on Talend Studio??

 

Employee

Re: How to filter the row data by comparing with each row

Hi,

 

    Your intention is good and interesting one. But if you are trying to do it in a single subjob, it may not be possible as it will become cyclic dependency.

 

     I hope you are pretty clear till segregating Unique and Duplicate data and I would advise to stop your subjob at this logical point. You can capture the output either to interim tHashOutput or interim files for futher processing.

 

    The next step is to filter the old data in duplicate data set. You can pass the data to taggregaterow to identify the oldest or latest record using min and max functions. Once you identify this dataset, you can use it as lookup where you can join with duplicate dataset (main flow) using inner join condition. In this way, you will be able to get the unique record from duplicate dataset. 

 

    After that you can use tUnite to join according to your requirement.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

 


Warm Regards,
Nikhil Thampi
Please appreciate our members by giving Kudos for spending their time for your query. If your query is answered, please mark the topic as resolved :-)
Five Stars

Re: How to filter the row data by comparing with each row

Thank you for quick reply and useful advise!

 

By using tAggregationRow, I could filter the duplicate data set into unique date set with the newest rows.

 

2 Ken 2019-05-13 11:54:21
3 Smith 2019-05-13 11:54:51

 

After this filtering, I'm going to do unite the the unique data set created by the tMatchGroup and the unique data set created by tAggregationRow as I mentioned above.

 

If there is any problem with this process, please give me advise again.

Five Stars

Re: How to filter the row data by comparing with each row

Hi,

 

Please let me ask one more question.

This is very simple one.

 

After executing the tMatchGroup component, the I found the extra columns added to my output data set like GID, GRP_SIZE, MASTER, SCORE, GRP_QUALITY.

 

How can I trim these colums after tMatchGroup??

What component should I use for this?

 

Thanks in advance.

Highlighted
Employee

Re: How to filter the row data by comparing with each row

Hi,

 

    You can pass the columns to a tMap and in the output section, you can ignore the unwanted columns.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

 


Warm Regards,
Nikhil Thampi
Please appreciate our members by giving Kudos for spending their time for your query. If your query is answered, please mark the topic as resolved :-)
Seven Stars

Re: How to filter the row data by comparing with each row

Another simple way to Implement the SCD Type 1TalendImage.JPG

 

 Merge Main and Delta --> Sort on ID and TimeStamp --> aggregate --> Output

Regards
Abhishek KUMAR
( Please mark the post as resolved if you found your solution )

Cloud Free Trial

Try Talend Cloud free for 30 days.

Tutorial

Introduction to Talend Open Studio for Data Integration.

Definitive Guide to Data Integration

Practical steps to developing your data integration strategy.

Definitive Guide to Data Quality

Create systems and workflow to manage clean data ingestion and data transformation.