Five Stars

Writing 260 columns to Excel

Hello,

 

I'm extracting 260 column data from SQL Server database and trying to write to an existing Excel file with headers.

 

I am using tFileOutputExcel component.

 

My Run Parameters are like this:

-xms256M

-xmx2048M

 

and in Windows -> Preferences ->Talend -> Run/Debug, I gave:

-XX: useGCOverheadlimit.

 

What else can I do to make this work? I have to write to an existing file since other tabs are filled with data from other processes.

 

Thanks,
Bee

 

The job fails with the error message:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.sun.org.apache.xerces.internal.dom.AttributeMap.setNamedItem(AttributeMap.java:121)
at com.sun.org.apache.xerces.internal.dom.DeferredElementNSImpl.synchronizeData(DeferredElementNSImpl.java:138)
at com.sun.org.apache.xerces.internal.dom.ElementNSImpl.getNamespaceURI(ElementNSImpl.java:250)
at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1420)
at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1385)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1370)
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
[statistics] disconnected
at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit 

Tags (1)
19 REPLIES
Employee

Re: Writing 260 columns to Excel

What part of the job is taking the most time and loading most data in memory?  Post a screenshot of you job

Five Stars

Re: Writing 260 columns to Excel

The job is very simple. Has a tMSSQLInput component which does a select <columns> from table and outputs to Excel.

 

 

Seven Stars

Re: Writing 260 columns to Excel

Does it run out of memory if you overwrite the spreadsheet file instead of appending to an existing one?

Six Stars

Re: Writing 260 columns to Excel

Basically it's a memory issue, you need to increase your run time parameters to their highest extent that your computer's resources allow you to do so!
Seven Stars

Re: Writing 260 columns to Excel

Pretty certain this will be down to the Excel output component loading the entire existing spreadsheet into memory in order to append the new rows.

 

The database input component will only be operating on relatively small batches of records (probably 1,000) at a time, so that wont be the problem, but whilst the basic file output components can just write their output directly to the file on disk, again requiring very little memory, Excel files, especially multi sheet ones, are much more structured than this, and lines can't just be appended to the end of the file, they need to be added within the body of the file, and that will be done in memory.

 

At a basic level, in the short term, increasing the memory may well help, but if this Excel file is going to keep getting bigger, then the problem is just going to rear its head again soon, and you'll then be out of options.

 

Excel files aren't particularly well suited to massive volumes of data, and if you're exhausting the 2Gb+ of memory your Java config is allowing, then this suggests a lot of data and/or a big spreadsheet.

 

If I were you, I'd be looking for an alternative to embedding the data in this spreadsheet. I'm sure if you were to provide a little more background to your requirements, the community members will be able to offer suggestions.

Five Stars

Re: Writing 260 columns to Excel

Thank you. The requirement is pretty straightforward and simple, except it is not.

 

I am executing a stored procedure which generates a very wide (260+ columns) dataset. This needs to be written to a pre-formatted excel sheet (containing 3 other sheets) template file which also contains the header.

 

A direct tMSSQL to tFileOutputExcel fails with memory issues. So trying to write to a csv hoping to push the data to excel.

But looks like there is no luck at all.

 

Any idea how I can do this?  Number of columns is 260. Number of rows is #2000.

Seven Stars

Re: Writing 260 columns to Excel

If you're only outputting around 2,000 rows, then I'm very surprised that you're running out of memory.

 

How big (both in terms of overall file size, and number of rows/columns per sheet etc.) is the template spreadsheet which you're adding the data to?

 

Also, is it an XLSX file, as opposed to an old XLS format one?

Five Stars

Re: Writing 260 columns to Excel

Yes, it is an old xlsx file.

 

The # of columns is 260 and around 3000 rows right now. But can grow.

 

Are there any other alternatives?

File with the empty template is around 10MB

Ten Stars

Re: Writing 260 columns to Excel

There are some Excel related components in the Talend Exchange.  (https://exchange.talend.com)

 

Jan Lolling created a set of Excel components and will often respond here if you have questions about their use.  If you can't get the job to work with tFileOutputExcel, maybe you'll have better luck with another one?

Five Stars

Re: Writing 260 columns to Excel

An update on this:

 

I tried to write it to a new xls file without the header - it worked ok.

 

Wrote the complete data to csv and tried to write to a new excel file - that was ok too.

 

Wrote the CSV to Excel without header - worked fine. Then tried to Append the header to the said excel - failed again.

 

Then tried to write the csv to the excel template and increased the jmx and jms parms to 1600M  - failed with out of memory error.

Then increased to 11553M - it loaded the entire data into memory and just hung - no output to Excel.

 

This is getting very frustrating - and feel there has to be some other solution. 

Rows can vary from 3000 to 150K. # of columns is static  - 260.

 

 

Seven Stars

Re: Writing 260 columns to Excel

Spent some time investigating over the weekend, and it's essentially as I thought in my original reply - that Talend is trying to load your entire data set into memory in order to merge it with the existing Excel spreadsheet.

 

If you're creating new files/completely overwriting existing ones, then the data is just streamed, and you'll notice that it flows through a relatively small number of records at a time, and you'll see the output file being built on the disk as the job proceeds. The key thing here is that at no point is the entire data set in memory.

 

If however, you try to merge files, Talend needs to load the entire file, and all of the data to be inserted into memory first in order to insert the new content.

 

The number of rows alone isn't the problem, but a combination of this and the amount of data in those 260 columns, is, and it's exceeding the available memory.

 

In my tests I generated the data using a tRowGenerator, with 260 columns, each just containing a six character random string, and this was fine for 3,000 records, but failed with 150,000. Bear in mind that all of the columns in each of my rows only contained a small amount of data, and in your case it's likely that the amount of memory for each row will be much larger.

 

That you've increased the memory available to almost 12Gb (assuming those changes we successful) and it's still being exhausted must mean that you have a lot of data, and so this approach simply isn't going to be an option.

 

I did try outputting the data to an Excel file (not merging) which generated a file of around 330Mb, then reading in the first page of the original template, and trying to append this to the newly created Excel file with all the data in, but it still has to load the entire spreadsheet into memory in order to do the merge, and I get the same issue. In my case, for testing, I'd stuck with the memory defaults as I was more interested in the differences between the approaches I was trying.

 

I can confirm that there are no issues with the number of columns being output, I've tested this well beyond the 260 you need.

 

For what it's worth, if you try to open a 330Mb spreadsheet with 150K rows and 260 columns in Excel, even if all of those cells just contain a string of random six characters, then it'll pretty much lock up, so I'm not sure how useful this would be even if it were possible to somehow get all of the data into your template.

 

How would the spreadsheet you're looking to generate be used? I wonder is there an alternative approach for making the data available to your end users?

Five Stars

Re: Writing 260 columns to Excel

The data in the SQL tables is generated by the Talend job and this is the final step in the process - create Excel and email it to clients automatically.

There are around 31 of these reports.

 

This talend process is replacing a very manual and tedious SAS process which writes to the excel file without a problem. So, unfortunately there is no way out of the excel file itself.

 

One workaround I thought of over the weekend is: to have a VB script automatically refresh the excel file with a prebuilt query after the table is loaded and call the vbscript from the command component.

 

However the challenge is, the dates are dynamic -meaning i should be able to run this for any range of dates. If this were the case, any idea if I can pass the date parameters to the query inside excel ?

 

Thanks in advance.

 

 

 

Seven Stars

Re: Writing 260 columns to Excel

It sounds like you may be able to get away with using the Get External Data options on the Excel Data tab to populate the data on your sheet, using a parameterized query with parameters taken from cells within your spreadsheet.

 

Would that be an option?

 

I seem to recall that there are problems (or maybe it was just me!) if you try to add parameters to queries which have been created using Excel's wizards, in that the option to add these simply isn't available, but Excel really isn't my forte, and it's been a while since I've tried this.

 

If the dates used for filtering aren't something the user will be entering manually themselves, and they could be calculated automatically, either based on the system date or contents of the database, then consider creating a Stored Procedure to return the data, and just calling this using Get External Data functionality, which would keep everything nice and simple.

Seven Stars

Re: Writing 260 columns to Excel

Something else you might like to consider, is using Jasper Reports Server or a similar reporting tool to generate the spreadsheets on demand.

 

Your options would then include:

 

  • Giving the clients a portal login, from where they can run the reports on demand, exporting to Excel as required. I often find that users insist they need everything, but rarely read/use it, so you may well find they just browse the output as an on-screen report to get what they need, rather than saving it down as an Excel or PDF file.
  • Scheduling reports to run regularly with dynamic parameters such as rolling date ranges, and the output (Excel, PDF etc.) automatically emailed to a standard recipient list, or better still, have the output files stored in the built-in online repository and a link sent to the user. Benefits with this approach include not having to send out potentially huge files unnecessarily - unless it's internal, most mail servers will reject anything over a certain size (often 10Mb or 20Mb) by default - and having a full history of the exports available to the clients.
  • As above, but triggering report generation from your Talend job. There are tJasper* components under Business intelligence, but I'll be honest, I've never used them. Or you could just use a tHttpRequest to trigger the report instead.

 

Jasper in particular is a pretty robust solution, which I've used previously in similar situations, and the free version will likely do what you need.

 

They have a fully featured Eclipse based report design tool, Jasper Studio, for authoring reports. It has a bit of a learning curve (and is unlikely to be much use for non-technical users) but is similar to Crystal Reports and other band-based reporting tools, and once you get over that curve, it's actually pretty good.

 

One of the things I've found really powerful using Jasper (and SQL Server Reporting Services for that matter) is linking reports to provide drill-down (by passing parameters to sub-reports) which might be useful in this case, considering potential the volumes of data you're talking about. Just create a high level report with totals for periods/departments etc., and set these totals up as hyperlinks to more detailed reports showing the actual individual data rows. It's not difficult to implement, and I've seen finance department staff get ridiculously excited by such things.

Five Stars

Re: Writing 260 columns to Excel

Thank you Chris for taking the time to review it and your responses. Yes, Portal is definitely the way to go and we are heading that way, but in the interim we need automated solutions for generating and delivering these reports.

 

I didnt get a chance to play around with refreshing the excel file yet - I did try to convince my higher ups that writing to a template will be more challenging than anticipated so, we decided to write to an independent excel file instead. It was a hard-sell considering that the current process can do it in SAS fairy easily.

 

However, I ran into another issue in Talend 

Previously, I could write a 4000 row 260 column dataset to a completely new excel sheet with headers.

 

But with 150K rows, 260 columns, the job fails with OutofMemory - increasing the memory just hangs.

So it looks like the behavior is the same for a brand new excel or an existing one?

 

Seven Stars

Re: Writing 260 columns to Excel

No problem.

 

Are you exporting to an existing spreadsheet which just has the headers in? If so, then you're going to have the same problem. Loading the existing file into memory may only take up a few hundred Kb, but in order to merge this with the data, the full dataset (which is likely to be huge) must also be loaded into memory, which is going to cause it to fail.

 

If you just output to a brand new spreadsheet or overwrite an existing one, then you'll see that the job is only processing a few hundred records at a time and outputting these to disk, so very little memory is consumed, and most importantly, this will work no matter how many rows you are processing. Of course, at some point you'll fun out of disk space, but that's not a Talend issue.

Five Stars

Re: Writing 260 columns to Excel

New Excel file. I increased the memory to 3000M and it seemed to have worked. It took about 4 minutes to write to the file with headers. I think we can live with this. I'll have to handle some formatting of the columns, but the header will lose its prettiness. The other sheets of the template file will have to be delivered separately.

 

I'll also have to build in some logic to detect number of rows and if it exceeds a million to split the data at a logical break and write to separate files. Am sure will run into issues there.

 

I dabbled with Jasper reports a while back - will take a look again. But won't we run into similar issues considering they all use the same excel engine? Is BIRT any better?

 

Thank you very much for all your help.

Five Stars

Re: Writing 260 columns to Excel

New Excel file. I increased the memory to 3000 M and it seemed to have worked. It took about 4 minutes to write to the file with headers. I think we can live with this. I'll have to handle some formatting of the columns, but the header will lose its prettiness. The other sheets of the template file will have to be delivered separately.

 

I'll also have to build in some logic to detect number of rows and if it exceeds a million to split the data at a logical break and write to separate files. Am sure will run into issues there.

 

I dabbled with Jasper reports a while back - will take a look again. But won't we run into similar issues considering they all use the same excel engine? Is BIRT any better?

 

Thank you very much for all your help.

Seven Stars

Re: Writing 260 columns to Excel

 

I suggested an approach to splitting large numbers of database records into more manageable batches for processing on another thread, which may be useful when you get around to splitting your data into multiple files:

 

https://community.talend.com/t5/Design-and-Development/tFileOutputDelimited-split-output-in-multiple...

 

Regarding Jasper - it would be generating the entire spreadsheet, rather than merging into an existing template, and so you shouldn't have memory issues like this. As far as I'm aware, it's only the Excel output library which is common between Talend and the Jasper engine, and the issues are down to Talend loading the entire data set at once, rather than problems with the actual output side of things.

 

That said, if you were considering using Jasper Report Server for this, it would certainly make sense to try a very basic example with your full dataset before going any further.