Cannot read xlsx files containing close to 1m rows of data

One Star

Cannot read xlsx files containing close to 1m rows of data

Hi Guys,
I created a job where I have to read an xlsx file (containing 987479 rows of data) and upload the same to the database.
Following is the error I receive when I run the job:
Starting job tet at 13:15 28/04/2011.
 connecting to socket on port 3548
connected
Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
at test.tet_0_1.tet.tFileList_1Process(tet.java:1833)
at test.tet_0_1.tet.runJobInTOS(tet.java:2007)
at test.tet_0_1.tet.main(tet.java:1878)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3039)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3060)
at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3250)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1802)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
disconnected
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
at org.openxmlformats.schemas.spreadsheetml.x2006.main.WorksheetDocument$Factory.parse(Unknown Source)
at org.apache.poi.xssf.usermodel.XSSFSheet.read(XSSFSheet.java:127)
at org.apache.poi.xssf.usermodel.XSSFSheet.onDocumentRead(XSSFSheet.java:119)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:222)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:200)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:172)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:188)
at test.tet_0_1.tet.tFileList_1Process(tet.java:1225)
... 2 more
Job tet ended at 13:16 28/04/2011.

Files with less no of rows and less in size can process data with ease.
Please let me know as to how I can read files with large amount of data in it. Also FYI I have tried increasing the heap size, but it dint help.
Help will be much appreciated.
Thanks
One Star

Re: Cannot read xlsx files containing close to 1m rows of data

Please check the attached image, to get a better understanding of my job. Thanks

Re: Cannot read xlsx files containing close to 1m rows of data

sweet jesus. While I understand you may have no choice, storing and transporting data in Excel is a road to pain and suffering for the ETL developer. If possible, get your data in a different format.
It looks like the error is arising as you read the file. This is done by an external library, so you may have limited control on how this is done. Have you tried increasing the heap space of the job from the default of 1G?
One Star

Re: Cannot read xlsx files containing close to 1m rows of data

Hi John,
Thanks for your reply. I had tried increasing the heap space, but it dint help. Smiley Sad

Re: Cannot read xlsx files containing close to 1m rows of data

If you dont need to load files like that in an automated fashion, try opening it in Excel and saving it as a CSV. Then use a tFileInputDelimited to read the file.
One Star

Re: Cannot read xlsx files containing close to 1m rows of data

I had tried that also. But practically I ll be wasting a lot of time, as I have atleast 400+ files with similar kind of data..! Smiley Happy

Re: Cannot read xlsx files containing close to 1m rows of data

outch. Keep upping the memory and eventually you should be able to read the file. If not, you'll have to collect the files that you cannot read and do something special for them (like converting to CSV or using a different excel library and reading with custom code)
One Star

Re: Cannot read xlsx files containing close to 1m rows of data

essayes avec cette option
-Xmx250m
Specify the maximum size, in bytes, of the memory allocation pool. This value must a multiple of 1024 greater than 2MB. Append the letter k or K to indicate kilobytes, or m or M to indicate megabytes. The default value is 64MB. Examples:
-Xmx83886080
-Xmx81920k
-Xmx80m
One Star

Re: Cannot read xlsx files containing close to 1m rows of data

have you found a solution?
Same problem for me
thank in advance
bye