optimization of Talend job with tnormalize

One Star

optimization of Talend job with tnormalize

Hello,
I have one field in a table which is a list of id separated by commas (as : '1,3,5,8').
I have in another table the label of these ids, then I would like to have in my final field, the list of label (as : 'Test, Acc, Prod, Finish').
I have found one way to solve this problem, using tnormalize and then tdenormalize. The problem is that my Talend job is very very slow if I put this part as lookup on my main flew, and it is not working if I put it in my main flew (I have Java memory error) because it manages more than 75000 lignes with a lot of columns.
Anyway, my question is, how can I manage this case ? Is there another way which could make my job being faster ?
Thank you for your help
Regards

Re: optimization of Talend job with tnormalize

could you not preprocess the dataset that contains your "array" field *before* trying to join it?
in this way you could actually run a 'select ID, col_ARRAY from urTable' load these into Talend denormalise it.
save into a temp table/file and then use it.
see if that helps.
One Star

Re: optimization of Talend job with tnormalize

Hi
Save temporary datas of tNormalize and tDenormalize into a table/file. Don't look up directly.
Index key columns of your tables when you update or insert datas.
Because of insufficient memory, you have to change parameters of JVM as follows.
set JAVA_OPTS=-Xms500m -Xmx500m -XX:MaxNewSize=280m -XX:MaxPermSize=280m

This time, it may work if you put it in main flow.
Best regards!
Pedro
One Star

Re: optimization of Talend job with tnormalize

could you not preprocess the dataset that contains your "array" field *before* trying to join it?
in this way you could actually run a 'select ID, col_ARRAY from urTable' load these into Talend denormalise it.
save into a temp table/file and then use it.
see if that helps.

Thank you for your suggestion, I have already thought about this option. The problem is that it will also be slow because I will have to load all my ID twice !!!! I have more than 75 000 IDs in Dev, and more than 100 000 in Prod environment. So finally it will also be slow to load all lines of my tables twice.
Do you understand what I mean ?
One Star

Re: optimization of Talend job with tnormalize

Thank you both of them for your answer. It seems that you are saying same thing that it could be faster to put this information in another table. But I am not sure to understand exactly.
Let me add few other information :
before the tnormalize i have this kind of datas :
ID | col_array
1234 | 1,2,5
1235 | 4,7
...
after the tnormalize I have :
ID | col_array
1234 | 1
1234 | 2
1234 | 5
1235 | 4
1235 | 7
...
then I join col_array with my table of label and I obtain :
ID | col_array
1234 | Label1
1234 | Label2
1234 | Label5
1235 | Label4
1235 | Label7
...
then I use my tdenormalize and I finally obtain what I was especting :
ID | col_array
1234 | Label1,Label2,Label5
1235 | Label4,Label7
...
The problem is that in my ID column I have like more than 75 000 lignes, so it's slow, and it can be bigger in few months. That's why I need to optimize my job.
So what both of you are recommanding me is to put WHICH part of these datas in another table ? before or after the tnormalize/tdenormalize ?
And should I load all my IDs again ?
One Star

Re: optimization of Talend job with tnormalize

Hi
It's hard to optimize performance only by several words. Especailly we can't get details of your input datas, your current system environment.
The easiest way is to separate your job into several parts by each tMap and avoid looking up at the same time.
Save temparary results into some files or tables.
Use tRunjob to combine all these subjobs.
Hope this can help you.
Best regards!
Pedro

Re: optimization of Talend job with tnormalize

so you want to create a 'col_array_label' using your 'col_array'
may i suggest:
get the distinct from your 'col_array'
select distinct col_array from urTable

then denormalise all of these
lookup their labels and create your 'col_array_label' output
*then*
inner join the output with your dataset
this should work pretty well if you have small universe (few combinations) in your col_array
*suggestion* - you may want to review this data structure and keep these 'arrays' in another table with a PK and refer to that instead of having it stored multiple times in your table; even on a reporting DB.
One Star

Re: optimization of Talend job with tnormalize

so you want to create a 'col_array_label' using your 'col_array'
may i suggest:
get the distinct from your 'col_array'
select distinct col_array from urTable

then denormalise all of these
lookup their labels and create your 'col_array_label' output
*then*
inner join the output with your dataset
this should work pretty well if you have small universe (few combinations) in your col_array
*suggestion* - you may want to review this data structure and keep these 'arrays' in another table with a PK and refer to that instead of having it stored multiple times in your table; even on a reporting DB.

Thank you for your proposition. actually the process is more or less what I have developped, but all the difference is that you propose to make a distinct on the different value and after the denormalize to join the this field, and not the ID of my main flow, which means that it take only few hundred of lines in my normalize lookup, and not the 75 000 lines. This is a great idea !!!
Thank you for you help, waiting a better way, I think is the best one for the moment Smiley Happy