[resolved] tUniqueRow java heapspace issue

One Star

[resolved] tUniqueRow java heapspace issue

Hi All,
I am essentially trying to do a select distinct to get unique rows from a relatively small data set - 420,000 rows x 55 columns. I am using the tUniqueRows component and persistently getting java heapspace errors.
I have tried a number of options - increasing the jvm parameter up to 2048; increasing the page file; using tHashOutput and tHashInput files; doing the unique on a single column - where I would ideally like to do it across all; and writing the data set out into a delimited file in my parent job and moving the tUniqueRow into a separate job and reading the delimited file back in there.
I have tried using the tUniqueRow component with standard setting first with all of the above mentioned options, and then also setting the tUniqueRow component settings to use disk with a buffer size of 1000 for all above mentioned options - seems to make little difference to the final outcome.
When using the disk and buffer size settings, the job manages to load all rows into the tUniqueRow component, but then fails with the java heapspace error before outputting any results. I have tried output to delimited file (preferred) and also to tHashout and even tLogRow, just in case it was writing to the delimited file that caused the error.
I suspect the large number of columns is the problem, but am not sure how I can easily remedy this situation.
Any ideas???

Error as follows -
Starting job BMD01_UniqueRow at 11:49 05/12/2012.

connecting to socket on port 3550
connected
Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
disconnected
at moscow1.bmd01_uniquerow_0_1.BMD01_UniqueRow.tFileInputDelimited_1Process(BMD01_UniqueRow.java:5212)
at moscow1.bmd01_uniquerow_0_1.BMD01_UniqueRow.runJobInTOS(BMD01_UniqueRow.java:5393)
at moscow1.bmd01_uniquerow_0_1.BMD01_UniqueRow.main(BMD01_UniqueRow.java:5258)
Caused by: java.lang.OutOfMemoryError: Java heap space
at moscow1.bmd01_uniquerow_0_1.BMD01_UniqueRow$1FileRowIterator_tUniqRow_1.load(BMD01_UniqueRow.java:4214)
at moscow1.bmd01_uniquerow_0_1.BMD01_UniqueRow$1FileRowIterator_tUniqRow_1.next(BMD01_UniqueRow.java:4239)
at moscow1.bmd01_uniquerow_0_1.BMD01_UniqueRow.tFileInputDelimited_1Process(BMD01_UniqueRow.java:4320)
... 2 more
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-0"
Job BMD01_UniqueRow ended at 11:53 05/12/2012.

Accepted Solutions
One Star

Re: [resolved] tUniqueRow java heapspace issue

Thanks again for your help with this jlolling.
I have tried the tAddCRCRow and tAggregateRow job design you have suggested and it looks like it will work for my requirements - I assume I need to join the output from the tAggregateRow to the original file on CRC to get all of the non-aggregated columns back into my output. Sorry, not too familiar with the workings of the tAggregateRow component... Seems to work though.

All Replies
Community Manager

Re: [resolved] tUniqueRow java heapspace issue

Hi
As you did, try to store the data on disk on tUniqRow component, don't use any hash components and tLogRow component in the job, it will consume memory during the job execution.
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] tUniqueRow java heapspace issue

Hi Shong,
Thanks for your reply, but I have tried this exact method. My job has only 3 components - tFileInputDelimited, tUniqueRow, tFileOutputDelimited.
Any other ideas / settings???
Moderator

Re: [resolved] tUniqueRow java heapspace issue

Hi,
What about your JVM setting, i have seen
increasing the jvm parameter up to 2048

and I mean both the -XMS and -XMX
Such like:
-vmargs
-Xms256m
-Xmx1024m
-XX:MaxPermSize=256m
The available Heap size less than 2% will throw this exception. Are there any other running program in your computer?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: [resolved] tUniqueRow java heapspace issue

Hi Sabrina,
Thanks for your reply. My config is:
-vmargs
-Xms64m
-Xmx2048m
-XX:MaxPermSize=256m
-Dfile.encoding=UTF-8
Should I change the -xms as well? I have tried running the job with no other programs running, but it didn't fix the issue sadly...
Seventeen Stars

Re: [resolved] tUniqueRow java heapspace issue

I think, your problem will not solved with more RAM. You need an different concept.
I suggest as first building a MD5 or SHA1 hash value over your columns which should be distinct and write the result (all columns and the additional checksum) into a new file or database table.
After that use the component tAggregateRow and use the column checksum to detect the uniqueness and use all other columns with the first-method in the calculated area.
One Star

Re: [resolved] tUniqueRow java heapspace issue

Hi jlolling,
Thanks for your reply and I think you are right about requiring a redesign, however I have no experience in building a solution like you have suggested. I will do some research and see what I can work out.
Seventeen Stars

Re: [resolved] tUniqueRow java heapspace issue

To build the hash values there are dedicated components in Talend is tAddCRCRow. This components helps you to add an CRC sum for selected columns of your flow to the output flow as additional column.
tFileInputDelimited (source) ---> tAddCRCRow ---> tFileOutputDelimited (temp file)
OnSubjobOk
tFileInputDelimited (temp file) --> tAggregateRow --> tFileOutputDelimited (target)
Seventeen Stars

Re: [resolved] tUniqueRow java heapspace issue

In the enterprise edition there are dedicated hash components.
You can also try to find useful components in Exchange. Here an example:
http://www.talendforge.org/exchange/index.php?eid=137&product=tos&action=view&nav=1,1,1
One Star

Re: [resolved] tUniqueRow java heapspace issue

Thanks again for your help with this jlolling.
I have tried the tAddCRCRow and tAggregateRow job design you have suggested and it looks like it will work for my requirements - I assume I need to join the output from the tAggregateRow to the original file on CRC to get all of the non-aggregated columns back into my output. Sorry, not too familiar with the workings of the tAggregateRow component... Seems to work though.