Performance issue with below design

Five Stars rm
Five Stars

Performance issue with below design

Gurus,
I'm new to talend. Got struck with the performance issue. Kindly help me, to fix it.
I have records in million. No chance to extract the data from db, all were from file.

>>Removing duplicates and retaining the record based on max date(used tsort and tuniq component)
>>Used different filter conditions on tmap and tfilterrow.

Job failed due to "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
>>Increased the VM Argument to -Xmx4096M
But still, I got the same error.
>>Written to temp file in tmap and sorted on disk in tsort,
Got the same error.

My questions:
-->Sort is the main culprit. Any other possible ways to sort the data(don't have staging db to sort)?
-->I'm reading the same reference file twice,why because I cannot redirect the single tinputdelimited to two tmap reference. Is there is any way to read the file only once?
-->How the overall design can be improved?
Some guidance will be greatly helpful.
Thanks
Seventeen Stars

Re: Performance issue with below design

I am struggling with your job design. tMap_3 and tMap_4 have no output at all and therefore useless.
Five Stars rm
Five Stars

Re: Performance issue with below design

In reference files,I have 8 columns. I'm doing some filter & removing few columns from tmap3 & tmap4(cannot hold all the unwanted columns in lookup buffer). It can be integrated in tmap1 and tmap2, also but for the sake of debugging(trace count from each link), I have used tmap3 & tmap4. Do you think, that will affect the performance?
Thanks
One Star

Re: Performance issue with below design

Hi rajmhn,
You don't require tmap3 and tmap4 as well you can filter those columns in tmap1 & 2  and enable sort on disk option in tsort advanced settings.
You can also try by removing filter step and do filter in tmap(i don't think so will get performance but try once).
Thanks,
Siva.
Five Stars rm
Five Stars

Re: Performance issue with below design

Thanks siva.
You don't require tmap3 and tmap4 as well you can filter those columns in tmap1 & 2
>>In reference, I have 8 columns. Only few required. I cannot take all the columns into tmap buffer. That's the reason, why I used tmap and also I'm filtering records based on few conditions(though can be implemented in tmap1 & tmap2).So I incorporated both functionalities in tmap3 & tmap4. 
Do you think, without using tmap2 & tmap3, it can be accomplished?
sort on disk option in tsort advanced settings
>>I already enabled it.
Job was very resource consuming. I'm getting 3 million records from source & 2.5 million from each reference. Allocated Xmx16384. Job completed in 6 minutes.
One general question, what is the sort algorithm used in tsortrow?
Thanks
Five Stars rm
Five Stars

Re: Performance issue with below design

Someone please help me out.
Moderator

Re: Performance issue with below design

Hi rajmhn,
In reference, I have 8 columns. Only few required. I cannot take all the columns into tmap buffer. That's the reason, why I used tmap and also I'm filtering records based on few conditions(though can be implemented in tmap1 & tmap2).So I incorporated both functionalities in tmap3 & tmap4. 

It can be achieved in tMap_1 and tMap_2 without using tmap3 and tmap4.
What's the current rows/s during the data processing(row rate)?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Five Stars rm
Five Stars

Re: Performance issue with below design

Thanks Sabrina.
It can be achieved in tMap_1 and tMap_2 without using tmap3 and tmap4.
>>I have 8 columns like A,B,C,D,E,F,G,H. Filtering the records on C,D,E,F,G(tmap3 & tmap4) and I'm taking only A,B,H to reference buffer(tmap1 & tmap2). It can be accomplished without using tmap3 & tmap4, but with the cost of taking all the columns A,B,C,D,E,F,G,H to reference buffer. Correct me, if I'm wrong.
What's the current rows/s during the data processing(row rate)?
>>It was around 5000 rows/sec
Solutions to consider:
>>Split the jobs into two, one till tmap2 and other job for sorting and remove duplicates.
>>Writing temp data on disk and assigning less JVM Xmx memory
>>Assigning more JVM Xmx memory
Which one would be feasible one?
Thanks
Five Stars rm
Five Stars

Re: Performance issue with below design

I am struggling with your job design. tMap_3 and tMap_4 have no output at all and therefore useless.

Thanks.I have 8 columns like A,B,C,D,E,F,G,H. Filtering the records on C,D,E,F,G(tmap3 & tmap4) and I'm taking only A,B,H to reference buffer(tmap1 & tmap2). It can be accomplished without using tmap3 & tmap4, but with the cost of taking all the columns A,B,C,D,E,F,G,H to reference buffer. Correct me, if I'm wrong.
Moderator

Re: Performance issue with below design

Hi,
tMap is a cache component consuming two much memory. For a large set of data, try to store the data on disk. Did you get any outofmemory issue on your end?
Have you already checked the document about:TalendHelpCenter:Exception outOfMemory
Would you mind uploading your tmap3 and tmap4 map editor screenshot into forum?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Five Stars rm
Five Stars

Re: Performance issue with below design

Thanks.
Yes, initially I got the Outofmemory issue. I tried two scenarios.
Scenario 1:
>>Increased the Xmx to 16GB, it worked. Performance was very good(6 min). Is it good idea to use this much memory?
Scenario 2:
>>Reduced the Xmx to 8GB and used option store on disk in tmap_1 & tmap_2. But performance was not good. With this option tmap is sorting the data and storing into disk before join.
Didn't apply store on disk to tmap_3 & tmap_4. Do you think, that will be good idea?
I cannot upload the screen shot. Getting this issue 
Error : The server was unable to save the uploaded file. Please contact the forum administrator at

In tmap_3 & tmap_4, I removed the unwanted columns(9 to 3 columns) and filtered the records based on few condition.
Thanks
One Star

Re: Performance issue with below design

You could split this into multiple jobs to do the filtering and deduplicating. Then pass the cleaned data into the job above without tMap_3 & 4 and remove sort , tUnique, and tFilterRow.