Heap space / GC balancing act trying to get any job to run

One Star

Heap space / GC balancing act trying to get any job to run

I have a job I've built at Talend support's suggestion to test their product out, and neither they nor I seem to be able to find any way to get it to run.
It has two components, a tPostgresqlInput and a tFileOutputDelimited. It reads in one table (properly defined in the metadata), and it's supposed to dump it to the output file.
I'm using TOS 4.0.2.r43696 and am working with this as a java project, rather than perl (I prefer perl, but can't seem to do anything with Talend on the perl side of things tha doesn't just completely cripple my workstation without any output). My workstation is a Core i5 2.5GHz with 2GB of RAM and 380GB free hard drive space.
The table being read in has 240,380 rows in it, and 255 columns.
At the default heap space settings, the job almost immediately dies with an OOM error saying it's run out of heap space.
I increased the heap space settings to a starting value of 768M and a max of 2G and ran the job again, and then instead of the heap space error I got this one:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

Searching for a solution to that, I found a thread somewhere on these forums that suggested I might be able to alleviate that error by adding these parameters to the JVM settings of the job:
-XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode
-XXSmiley FrustratedurvivorRatio=16
I added those, and now I'm again getting heap space errors again.
I know I can load all of those rows into RAM without a problem, because I've done it in my own perl script, in CloverETL and in Pentaho Data Integration without any issues, so I'm left without any suggestion as to what's actually preventing me from running even this very basic job with TOS.
Can anyone suggest somewhere else I can look to find the cause of these errors, or a potential work-around for them? Talend's best suggestion so far is that perhaps the JDK I'm using (OpenJDK 1.6.0_18 that installs from Ubuntu's 10.04 repositories, rather than Sun/Oracle's JDK) is the cause of the problem, but I'd assume if that were the case I'd be seeing similar behavior from the other tools I've tested, since they're also big java apps running their jobs through java. I'm willing to be proven wrong, so feel free to correct me...
I'm really running out of ideas here, and Talend's pre-sales support doesn't seem to be able to track down a cause for this or my perl project preformance troubles so far, so if anyone could help out in some way I'd really appreciate it.
One Star

Re: Heap space / GC balancing act trying to get any job to run

UPDATE: So, while waiting I swapped out OpenJDK for Sun's java, and if you can believe it the performance got WORSE.
Now, while the job is running, I can't use my workstation because Ubuntu greys out every window I switch to (its behavior when it sees itself or an app not responding), and I STILL run out of heap space.
Also, now when the error is thrown, Talend can't (or doesn't) kill the job, and I have to kill it manually to be free up the chunk of RAM it's using. It's not that the job kept running, my traces and monitors show no further activity on the network from my box after the heap space error.
Employee

Re: Heap space / GC balancing act trying to get any job to run

Did you increase the Xmx parameter for your job ?
Can you post a full screenshot of your job ?
If you use a tMySqlInput component check the "stream" option in the advanced settings.
One Star

Re: Heap space / GC balancing act trying to get any job to run

I already stated that I increased the Xmx. I've tried it at 2G and 1G. Currently I have it at 1G so my machine swaps less.
I've attached a screenshot of the job. Like I said before, it's just a postgres table input and the delimited file output. No other components, at all.
I don't see a "stream" option in the advanced settings of the Postgres table input. I've tried checking the "Use cursor" checkbox per tech support's suggestion, but it didn't change the results.

I put together a table with only 100 rows to test with, and my job DID run to completion, but obviously I have a need to deal with more than 100 rows. We have at least 10 tables that are 20 million rows or more that will need to be handled by the jobs we run in whichever ETL tool we decide to purchase, and out of the 150 or more tables that will be managed by jobs we run, only 5 of them have less than 100,000 rows.
Note: the warning icon on the table input says:
Parameter (Query): schema is different from the query.

This could be because when I retreived the schema to build the metadata for this table I only selected certain columns, rather than all 255 columns in the table. If that wouldn't cause that warning, I wouldn't begin to know what would, as anything beyond that was 100% Talend's own code.
Employee

Re: Heap space / GC balancing act trying to get any job to run

Strange issue : the cursor option on such a job should be enough to no consume so much memory.
After how many rows your job start to reach memory limit ?
One Star

Re: Heap space / GC balancing act trying to get any job to run

If I'm to assume that talend will display the number of rows read while it's working with large datasets like it did when I ran with the 100 row data set, it doesn't read a single row.
Nothing ever gets displayed on the UI showing that any rows have been read. Is there another way I can run my job so I can see that information?
EDIT: just double checked. When it runs out of heap space, the text it displays is "Starting".