One Star

tHashOutput and memory

Hi,
I am having some issues with memory usage when using tHash components.
Environement:
All componets run on the same physical machine:
- Talend 3.2.0
- Windows XP SP2
- MYSQL 5.1.41
- 2gig physical ram.
- Java VM 1024M
Job:
The attached job is a example case to illustrate what I am seeing. The standardisedObservations MYSQL table contains about 5 million rows consting 16 strings 2 integers and a long (Primary Key). The tMysqlInput_1 component that reads this table has stream turned on. The tHashOutput is set to persist to file. The tHashInput is set to read its data from the has file output by the tHashOutput.
Expected:
- Very little memory usage in Talend Job as the rows should be persisted to file. I need to load an arbitrarily large row set into this hash not limited by vm or physical memory. This is why the persist option was chosen.
Actual Observations:
- Jobs fails with an outOfMemoryError (see attached) while loading the tHashOutput. No rows are processed past this point.
- It always fails at approximatly the same point 1.55 million rows.
- The hash file "in.hash" is never at any stage written to disk. I checked at various time during execution.
- The physical memory useage is large. It seems to fail when the java VM max memory settings are reached. (see attached)
- The windows page file dosn't seem to be full. Other application still operate OK, if a little slow.
Things I have tried:
- Changing the heaps size to various levels between 2MB (default) to 500MB. No change observed.
- Set the tHashInput_1 to link with the tHashOutput. No change observed.
- Smaller row sets work if you stay under the 1.5 million row threashold.
- Closing all other apps except for mys sql and Talend Open Studio. No change observed.
- using tArray components instead but this has no memory options.
- using the default hash file location.
I have searched the bug tracker without success. I have also searched the forum and site.
This page is from a guy who was having a similar problem:
http://www.talendforge.org/forum/viewtopic.php?id=4152
I am not sure if this is a design isssue, if I am miss understanding the way this componet is supposed to work, or if there is some bug in the tHash component. If anyone can shed any light I would be appreciative.
Cheers,
Danny
4 REPLIES
One Star

Re: tHashOutput and memory

Can anyone tell me whats going on here?
I have an alternate solution using flat files but I would really like to know why I am seeing the memory usage. I have done another scan of the documentation and other than the link above this component seems undocumented.
Cheers,
Danny
Six Stars

Re: tHashOutput and memory

As far as I know tHash is a memory only component; it could be some persistence to file only if the job completes successfully.
I think you simply hit VM limits (the exception is risen in java library code during string handling); maybe with a 64bit platform (>2GB address space per process) you can achieve an all in memory processing.
One Star

Re: tHashOutput and memory

Thanks mate for the reply.
Seems very limiting with regards to Talends ability to handle truly large data sets if that is the case.
Cheers,
Danny
Six Stars

Re: tHashOutput and memory

Talend can handle big volumes when designs are all in-flows (no caching of big recordsets)...
Anyway I completely agree that Talend should provide better all-in-memory components (ie. a caching component that automatically compress data when storing in ram, because db data are usually super-redundant).
Said that I usually, in case of need, use embedded java database like Derby or HSQLDB to cache main database tables (I don't trust typeless flat files...) if it is required to avoid many re-reads of networked big databases; and if super fast performance is needed I place such databases on a RAM disk.

bye