One Star

Unable to read Excel files generated by another tool

I have a collection of Excel (XLS) files that were exported by a Researcher's statistical package. This data contains data that is protected by HIPPA (so I am not able to share an original copy of a file).
At present, altering their original form (or the process that created them) is not an available solution.
I have tried a very simple job...
 -- row1 (main) --> 

...I'm given the following error/stack output while trying to execute the job in TOS_DI-Win32-r97954-V5.3.0M2
Exception in component tFileInputExcel_1
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at jxl.biff.StringHelper.getUnicodeString(StringHelper.java:189)
at jxl.read.biff.WriteAccessRecord.<init>(WriteAccessRecord.java:50)
at jxl.read.biff.WorkbookParser.parse(WorkbookParser.java:820)
at jxl.Workbook.getWorkbook(Workbook.java:271)
at lpc_etl.testreadchartdata_0_1.TestReadChartData.tFileInputExcel_1Process(TestReadChartData.java:909)
at lpc_etl.testreadchartdata_0_1.TestReadChartData.runJobInTOS(TestReadChartData.java:1647)
at lpc_etl.testreadchartdata_0_1.TestReadChartData.main(TestReadChartData.java:1515)

I tried following the recommendation mentioned in 0014919 (updating the library for jxl). The existing library appeared to match the latest available on the jxl website, http://sourceforge.net/projects/jexcelapi/files/jexcelapi/, 2.6.12 (judging by file size). This did not fix the problem.
I have absolutely no problems opening the file in MS Excel (2010).
I have noticed that the file received from the statistical process output ("Output.xls") is about double the file size of the same file that results when doing a save as to "Book1.xls", and saving it as the same format (XLS, 97-2003)... what should essentially be just making a copy of the file. The new "Book1.xls" file can be read by TOS without issue (same schema, just simply pointing it at the original and getting the error, or this new copy and no error).
Additionally, choosing the same "sheet1" from the original (Output.xls) and saving it to CSV, and doing the same with the saved-as copy (Book1.xls) and the same "sheet1" and saving it to CSV as well... it results in files of identical size (to the byte). This tells me the actual data content in both are identical, but that there is some additional data in the file that is the cause for the error.
Does anyone have any thoughts or solutions as to how I can read these files with TOS DI? Is there another library I can swap out? Another way of reading the data? These files are around 50MB and I would hate to require a data specialist to have to open and save-as each before running an processing job for a collection of them each time.
I thought it might have something to do with the encoding (double-byte character encoding causing the file size to be almost doubled)... but it seems hard to believe that the jxl library is unable to read that. I played with the encoding under the advanced settings of the tFileInputExcel_1 file (to no avail).
Any assistance would be greatly appreciated.
9 REPLIES
Seventeen Stars

Re: Unable to read Excel files generated by another tool

The jxl library is not the best for reading Excel files. I recommend Apache POI.
I have created a bunch of components tFileExcelWorkbook, tFileExcelSheetInput, tFileExcelSheetOutput and tFileExcelSheetList. All of them are available in Talend Exchange.
Try this components, they alsways use the Apache POI library.
Apache claims for it, they can read nearly everything and keep untouched what they cannot parse.
One Star

Re: Unable to read Excel files generated by another tool

Thank you so much, that is a great idea... those components (assuming they'll be able to read the files) should work perfect!
That being said, I went through and downloaded and installed them, but now TOS is complaining about the following file being missing...
FileExcelPOI-1.7.0-jar-with-dependencies.jar

That file was in the zip, and I found that when TOS processed the component folder it did place the file in the same folder as the rest of the component's files.
C:\Talend\TOS_DI-Win32-r97954-V5.3.0M2\plugins\org.talend.designer.components.localprovider_5.3.0.M2_r97954\components\ext\user\tFileExcelWorkbookOpen

I wonder why it is still complaining about this and giving me an error. Any ideas?
One Star

Re: Unable to read Excel files generated by another tool

Info from log...
!ENTRY org.talend.platform.logging 1 0 2013-03-01 11:52:52.136
!MESSAGE 2013-03-01 11:52:52,136 INFO org.talend.commons.exception.CommonExceptionHandler - The download URL for FileExcelPOI-1.7.0-jar-with-dependencies.jar is not available

!ENTRY org.talend.platform.logging 1 0 2013-03-01 11:58:15.462
!MESSAGE 2013-03-01 11:58:15,462 INFO org.talend.commons.exception.CommonExceptionHandler - The download URL for FileExcelPOI-1.7.0-jar-with-dependencies.jar is not available

!ENTRY org.talend.platform.logging 1 0 2013-03-01 11:58:44.758
!MESSAGE 2013-03-01 11:58:44,758 INFO org.talend.commons.exception.CommonExceptionHandler - Missing jars:FileExcelPOI-1.7.0-jar-with-dependencies.jar
!STACK 0
org.talend.commons.exception.BusinessException: Missing jars:FileExcelPOI-1.7.0-jar-with-dependencies.jar
at org.talend.designer.runprocess.java.JavaProcessorUtilities.handleMissingJarsForProcess(JavaProcessorUtilities.java:546)
at org.talend.designer.runprocess.java.JavaProcessorUtilities.sortClasspath(JavaProcessorUtilities.java:470)
at org.talend.designer.runprocess.java.JavaProcessorUtilities.computeLibrariesPath(JavaProcessorUtilities.java:323)
at org.talend.designer.runprocess.DefaultRunProcessService.updateLibraries(DefaultRunProcessService.java:191)
at org.talend.designer.runprocess.RunProcessService.updateLibraries(RunProcessService.java:167)
{snip}
Seventeen Stars

Re: Unable to read Excel files generated by another tool

Yes I have an idea. It is a very annoying bug in Talend Studio. Please try to install the component again. Sometimes the Generation Engine does not work well.
Please delete the file configuration/ComponentCache.javacache and restart studio.
I have always this problem and my solution is to delete this file every time I start Studio (I tweaked the start script in Linux).
Did you have installed the component over Talend Exchange or by copy into the user components folder?
One Star

Re: Unable to read Excel files generated by another tool

I've tried several times now, closed TOS, deleted the custom components folders, deleted the cache file you mentioned, restarted TOS. As soon as I drag a tFileExcelWorkbookOpen component onto the designer, it shows up with a red x and the error stating that the jar file is missing.
I even tried deleting the temporary job I had created to try these new components, emptied TOS's recycle bin, and recreated the simple job... still getting the error.
One Star

Re: Unable to read Excel files generated by another tool

I followed the instructions given here, https://help.talend.com/search/all?query=Installing+a+custom+component&content-lang=en (copied to the user components folder I specified and let TOS handle migrating them into the application structure).
One Star

Re: Unable to read Excel files generated by another tool

For any others reading this, my solution thus far (since I'm also creating a c# front end for this job) is to convert the XLS files into CSV files using this library, http://exceldatareader.codeplex.com/
Then just run the job with a delimited file input instead (and avoid the Excel/Talend issue altogether).
One Star

Re: Unable to read Excel files generated by another tool

Any update/insight as to how I might fix this?
One Star

Re: Unable to read Excel files generated by another tool

I just came across this blog...
http://bekwam.blogspot.com/2011/01/java-libraries-in-talend-open-studio.html
...might it help with the missing library issue?