How to filter the row data by comparing with each row

Six Stars

How to filter the row data by comparing with each row

Hi there,

 

I’d like to update original data with delta data.
(The original data is named “sample_org.tsv”, the delta is “sample_delta.tsv”)

 

Original Data

id name timestamp
------------------------------------------
1 Jorn 2019-05-11 12:49:01
2 Ken 2019-05-11 13:39:11
3 Smith 2019-05-11 17:53:33
4 Taro 2019-05-12 13:42:21
5 Hanako 2019-05-12 17:55:06

 

Delta Data
2 Kana 2019-05-13 11:54:21
3 Kenji 2019-05-13 11:54:51
6 James 2019-05-13 11:55:06

First of all, I united the both data files into one file with tUnite component.
After that, I separated the data into two groups using tMatchGroup; Unique and Match(It can be named Duplicate).
(The unique data file is named “sample_unique.tsv”, the duplicated is “sample_dup”)

 

Unique Data
5 Hanako 2019-05-12 17:55:06
6 James 2019-05-13 11:55:06
4 Taro 2019-05-12 13:42:21
1 Jorn 2019-05-11 12:49:01

 

Duplicate Data
2 Kana 2019-05-13 11:54:21
2 Ken 2019-05-11 13:39:11
3 Kenji 2019-05-13 11:54:51
3 Smith 2019-05-11 17:53:33


And then, I want to filter the Duplicate Data on the same ID with timestamp.
I want to delete the old data.

 

Duplicate Data
2 Kana 2019-05-13 11:54:21
2 Ken 2019-05-11 13:39:11  deleted because this row data is older than the row data above.
3 Kenji 2019-05-13 11:54:51
3 Smith 2019-05-11 17:53:33 deleted because this row data is older than the row data above

 

After this filtering, I want to unite the Unique Data with the processed data above.


What component and how can I realize this on Talend Studio??

 

Employee

Re: How to filter the row data by comparing with each row

Hi,

 

    Your intention is good and interesting one. But if you are trying to do it in a single subjob, it may not be possible as it will become cyclic dependency.

 

     I hope you are pretty clear till segregating Unique and Duplicate data and I would advise to stop your subjob at this logical point. You can capture the output either to interim tHashOutput or interim files for futher processing.

 

    The next step is to filter the old data in duplicate data set. You can pass the data to taggregaterow to identify the oldest or latest record using min and max functions. Once you identify this dataset, you can use it as lookup where you can join with duplicate dataset (main flow) using inner join condition. In this way, you will be able to get the unique record from duplicate dataset. 

 

    After that you can use tUnite to join according to your requirement.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

 

Six Stars

Re: How to filter the row data by comparing with each row

Thank you for quick reply and useful advise!

 

By using tAggregationRow, I could filter the duplicate data set into unique date set with the newest rows.

 

2 Ken 2019-05-13 11:54:21
3 Smith 2019-05-13 11:54:51

 

After this filtering, I'm going to do unite the the unique data set created by the tMatchGroup and the unique data set created by tAggregationRow as I mentioned above.

 

If there is any problem with this process, please give me advise again.

Six Stars

Re: How to filter the row data by comparing with each row

Hi,

 

Please let me ask one more question.

This is very simple one.

 

After executing the tMatchGroup component, the I found the extra columns added to my output data set like GID, GRP_SIZE, MASTER, SCORE, GRP_QUALITY.

 

How can I trim these colums after tMatchGroup??

What component should I use for this?

 

Thanks in advance.

Highlighted
Employee

Re: How to filter the row data by comparing with each row

Hi,

 

    You can pass the columns to a tMap and in the output section, you can ignore the unwanted columns.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

 

Eleven Stars

Re: How to filter the row data by comparing with each row

Another simple way to Implement the SCD Type 1TalendImage.JPG

 

 Merge Main and Delta --> Sort on ID and TimeStamp --> aggregate --> Output

Regards
Abhishek KUMAR

15TH OCTOBER, COUNTY HALL, LONDON

Join us at the Community Lounge.

Register Now

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

6 Ways to Start Utilizing Machine Learning with Amazon We Services and Talend

Look at6 ways to start utilizing Machine Learning with Amazon We Services and Talend

Blog