Data duplication handling - scenario

Seven Stars

Data duplication handling - scenario

Hi,

 

Can I get solution for the below scenario.

 

I am getting two records for a single product, only difference is one column value, but I need both values to be loaded into target on different columns. Duplication removal won't help me.

It would be great if I get an idea of optimal solution, because huge number of records coming from source.

 

Talend version: Open Studio for Big Data 7.0.1

Target: Salesforce

 

source:

ProductName  ImageType  imageurl
Nokia 610   LARGE /devices/generic-phone.png
Nokia 610   SMALL /devices/5145.jpg

 

Required Data on Target: (based on 'Image Type' above)

ProductName main_image_URL thumbnail_image_URL
Nokia 610  /devices/generic-phone.png /devices/5145.jpg

 

I tried by writing expression in tmap, but output is not as expected (below), it will create duplicate in target

 

ProductName main_image thumbnail_image
Nokia 8110 /devices/generic-phone.png  
Nokia 8110   /devices/5145.jpg

Accepted Solutions
Employee

Re: Data duplication handling - scenario

Hi,

 

    Why don't you take it as two data sets at the bginning and then do an inner join?

 

Dataset one :- where ImageType ="LARGE"

 

ProductName ImageType imageurl
Nokia 610  LARGE/devices/generic-phone.png

 

Dataset two:- where ImageType="SMALL"

 

ProductName ImageType imageurl
Nokia 610  SMALL/devices/5145.jpg

 

Now, do inner join based on Product Name and map the values two output flow in tMap as two variables.

 

Mapping in the tMap

 

ProductName -> ProductName

imageurl(small) -> thumbnail_image_URL

imageurl(big) -> main_image_URL

 

This should give the desired output.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

 

 

 

 


All Replies
Employee

Re: Data duplication handling - scenario

Hi,

 

    Why don't you take it as two data sets at the bginning and then do an inner join?

 

Dataset one :- where ImageType ="LARGE"

 

ProductName ImageType imageurl
Nokia 610  LARGE/devices/generic-phone.png

 

Dataset two:- where ImageType="SMALL"

 

ProductName ImageType imageurl
Nokia 610  SMALL/devices/5145.jpg

 

Now, do inner join based on Product Name and map the values two output flow in tMap as two variables.

 

Mapping in the tMap

 

ProductName -> ProductName

imageurl(small) -> thumbnail_image_URL

imageurl(big) -> main_image_URL

 

This should give the desired output.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

 

 

 

 

Forteen Stars

Re: Data duplication handling - scenario

@Vibin_CT ,check below job.

Untitled.pngUntitled.pngUntitled.pngUntitled.png

Manohar B
Don't forget to give kudos/accept the solution when a replay is helpful.
Six Stars

Re: Data duplication handling - scenario

Have u tried use tUniqueRow component before tMap?

 

I think what u need is something like this?

 

image.png

 

The desired output should be something like this

 

Seven Stars

Re: Data duplication handling - scenario

Hi @nikhilthampi ,

 

Thank you very much for your solution.

In-order to use your method, I want to load data into a intermediate database table, because the problem which I mentioned is not directly from source data, it is an intermediate data coming after doing so many transformations and I was unable to load this data into MySQL db(staging) due to MySQL table size limitation (some columns contains appended data and size is huge). So I am using thashoutput component.

 

I am also planning to use tfileoutputdelimited instead of thashoutput due to huge number of records and record size. Can you suggest me, which component is better to use by considering memory and performance.

Employee

Re: Data duplication handling - scenario

Hi,

 

    Considering your use case, park the data as interim file using tfileinputdelimited.

 

     Also increase the memory parameters (Xms and Xmx) of the job for better job performance.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

Seven Stars

Re: Data duplication handling - scenario

Thanks @nikhilthampi !!

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

Talend Integration with Databricks

Take a look at this video about Talend Integration with Databricks

Watch Now

How to Modernize Your Cloud Platform for Big Data Analytics With Talend and Micr...

Learn how<SPAN>to modernize your Cloud Platform for Big Data Analytics with Talend and Microsoft Azure</SPAN>

Read