Four Stars

Source loads as many times as tReplicate

Hi,
I'm using TOS 4.2.2 on Win2008 64bits.
With the job attached below, the source dwh_D_Produit loads 3 times : I can see in the GUI, in the row1 component, the number of lines increasing to the max (2867408) 3 times in a row before the rest of the job can carry on.
Can anybody explain why this is happening ?
Because of this loading three times, I got a java.lang.OutOfMemoryError: GC overhead limit exceeded that I resolved increasing the Xmx to 4096M (previously at 1024M). Still
the source is loading 3 times in the job.
9 REPLIES
One Star

Re: Source loads as many times as tReplicate

Do you have an iterate in the job?
Can you post a clear description of your job?
Four Stars

Re: Source loads as many times as tReplicate

Sorry I hadn't realised the image didn't upload.
There is no iteration, but a tReplicate.
Four Stars

Re: Source loads as many times as tReplicate

Please has someone got an idea on how to correct this issue ?
My original issue was bypassed by increasing the Java memory on the machine, but as the source table grows in size (now reached millions of lines), loading it three times causes java.lang.OutOfMemoryError quite often.
Thanks.
Moderator

Re: Source loads as many times as tReplicate

Hi,
Why did you connect tMap components using main row? I mean there is circles in your job design. Did you have any special requirement on it ?
Have you checked article about TalendHelpCenter:Can I create a Job with multiple paths from a single source to the same target?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Four Stars

Re: Source loads as many times as tReplicate

Hi,
Thanks for replying.
Yes I've found and read your article just after editing my post. From memory I had tried a similar solution last year and although it loads only once, it failed loading in memory millions of data.
The aim here is the following, maybe there is a better way of doing it ?
"Staging_tbl_Temp_Mouvement" contains all yesterday's shop transactions, which is really a detail of all the products sold. It usually contains between 200k and 500k of records.
"dwh_D_Produit" is our main product reference database and contains about 7.5M of records (for now).
The trouble is that the shops don't always use our main product reference table to create the products on their retail database. Therefore the daily transactions come with all kind of identification for the same product.
I'm trying to match that product the best I can from different maners in this job.
First step with tMap1 is to match the product from the transaction with the main table using the barre code 1. If match is found then that product goes into a temp table, otherwise it goes into tMap2 where I try matching with barre code 2. If that also fails the I try in tMap3 to match with brand, product reference, color and size.
While writing this reply I was trying out 2 solutions based on your article, see pictures attached. They both failed because they're loading once in the tHashOutputs and again in the tHashInputs, the memory usage keeps increasing during these loads and eventuall crashes before even trying to read the transaction table!
Employee

Re: Source loads as many times as tReplicate

Hi,
Here is my solution to your use case.
1st point : performance and memory : My solution is oriented to optimize performance and memory.
2nd point : You want to compare all rows from Staging_tbl_Temp_Mouvement with any potential rows from your product catalog.
In order to make sure, that we join all rows, I have added fake columns with value 1 in both input and lookup and join datas on this fake column.
Then, from this starting point I will compare datas in the column that might match.
3rd point : I guess that you will probably want to write the results of the 2 options to the same table at the end.
I'm using the join output feature in order to keep a single output with the same schema, but with different content.
Hope this helps.
Four Stars

Re: Source loads as many times as tReplicate

Hi, and happy new year !
Many thanks for this answer, I didn't know about the join output inside the tMap, it's a great feature. It makes the solution very simple in logic and design.
I've tried your suggestion and the only change I made is that I had to set the match model in the lookup to "All matches" in order to have all the desired results.
If I leave the match model to "unique match", only the first row from main is matched and the other are discarded.
I've first tried with some sample tables containing a dozens of rows only to validate the model and results.
The trouble of having set the match model to "All matches" is that it is doing a cross join and when I tried on the real database (7Millions in lookup x 300000 in main) , the performances are much worse than my original model.
My original model runs in about 15 min. The new one I stopped after 2h, it had only done 10%.
I attached some screen dumps in case I made something wrong.
In terms of memory, both model (original and new) end up using all what's available.
Any more advise appreciated ?
Thanks
Employee

Re: Source loads as many times as tReplicate

Match model doesn't impact memory or performance.
Are you still working with a tLogRow in output ?
This is far to be a performant component to print 7M*300 000 rows.
I would suggest to use a tFileOutputDelimited, or any other kind of component.
About memory, if your job consums too much memory for your system, you can check the "Store on disk option" in the tMap. But this option will necessarly have some impact on performances.
Kind Regards
Four Stars

Re: Source loads as many times as tReplicate

I've run the job this time in real conditions, taking the real tables as input and writing out into a copy of the table it would normally write into. You're right : removing the tlogRow did improve somewhat the perfomances. Still I had to stop the job after an hour, it had only treated 2000 rows from the main input out of 110800.
Having joined the tables on a fake columns is similar to doing a cross join and then filtering the results on the conditions placed in the tMaps outputs. I think this might be where the performance issue is. In my original model the filtering is made directly at the source, on the join between table. I hit a memory problem only because the biggest table is loaded 3 times in memory before the job can actually start.