One Star

tUniqRow memory consumption (input data cannot fit in memory)

Hi everybody!
In our current project, we're checking the unicity of a fairly important amount of data (6 million rows). The tUniqRow component which we use for this task yields an OutOfMemoryException when processing this quantity of data.
We've been investigating this issue and it revealed that the tUniqRow component uses an HashSet under the cover to check for unicity. The amount of memory consumed by the HashMap is therefore upper bounded by a number linear in the number of input rows.
Obviously, we've tried increasing the maximum java heap size but this is useless as the amount of rows that we need to process is way too important to fit in physical memory.
We've been thinking about delegating the unicity checking process to a temporary database instance (such as MySQL) but it would obviously be very time consuming in terms of importing data in and out of it.
Isn't there an alternate Talend component which would be able to deal with a quantity of data that cannot fit in memory such as one which would store the unicity index on disk instead of in memory ? As far as I know, the tMap component can do that already by storing intermediate data in a file on disk.
Looking forward for your answer.
Regards,
Olivier
3 REPLIES
One Star

Re: tUniqRow memory consumption (input data cannot fit in memory)

Hi Olivier,
is your data sorted? If so, you could remember the last row and make a diff between the actual and the last one (and supress the actual if they are equal.
Bye
Volker
One Star

Re: tUniqRow memory consumption (input data cannot fit in memory)

Hi wolker,
How do you do that? I can order my data from my SQL query but after I don't know how to store only the last row.
Cheers
One Star

Re: tUniqRow memory consumption (input data cannot fit in memory)

Hi Olivier,
I hope my answer isn't to late.
I think you could use a context variable to save the last key of your data (which should be unique). Than check the value of the actual row with this key. If it is the same set a flag (define a additional boolean variable) filter it out (tFilterRow).
I didn't test it but I think it should not use additional memory.
tInput -> tJavaRow -> tFilterRow
tJavaRow should be something like the following:
String actualKey= input_row.keyPart1 + "###A DELIMITER###" + input_row.keyPart2;
if (context.lastKey == null || ! context.lastKey.equals(actualKey)) {
output_row.filterThisRow= false;
} else {
output_row.filterThisRow= true;
}
context.lastKey= actualKey;

Bye
Volker