One Star

Reading multiple Worksheets from the same Excel file

Hello,
Here is another newbie issue.
I've a spreadsheet with 4 worksheets that have exact column layouts. It has 4 sheets mainly because of data volume exceeding what normal excel (pre-2007) can handle. I want to read all 4 sheets and combine them in a csv file.
I tried this with 2 excel inputs reading the same file but different sheets and used a tUnite. TOS takes a long time and then reads one input. Then it tries to read the next one and fails. I also see that perl.exe at this point is using almost 2 GB of memory (I've 3.25 GB on the machine). It fails with following error:
Starting job m_FL_Combine_BRC_Layouts at 23:07 11/03/2008.
connecting to socket on port 4127 ...
connected
Job m_FL_Combine_BRC_Layouts ended at 23:12 11/03/2008.
Each sheet has 20K and 30K rows. I can read them individually OK. May be tUnite is not the way to go but I am not sure what else. I cannot use iterate as it is the same physical file. I am running on WinXP, Office XP, PERL 5.8 and TOS 2.3.1.
Here is picture attached also.
Regards,
Sean
Tags (1)
14 REPLIES
One Star

Re: Reading multiple Worksheets from the same Excel file

I guess I do not know how to add an image. It is too late to figure this out today.
Community Manager

Re: Reading multiple Worksheets from the same Excel file

Hi
I guess I do not know how to add an image. It is too late to figure this out today.

When you edit your post, there is a group' Image Upload', you can upload some screenshot from local machine.(see the screenshot)
Best regards
shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: Reading multiple Worksheets from the same Excel file

Hello Sean,
you can try to read one sheet and write an temporary outputfile. Than do this again with the next sheet. At output use option append.
Bye
Volker
Employee

Re: Reading multiple Worksheets from the same Excel file

I've investigated on your topic and I have several conclusions:
- don't use N tFileInputExcel, but a single one with an iterate + tUnite (see screenshots)
- with a sheet of 50,000 rows with 14 columns (a 3 sheets file weights 32MB), tFileInputExcel uses 1.1GB in memory (that's huge). The whole file is loaded even if you just want to read the first sheet. As you want to read 3 sheets, the whole file is loaded 3 times. Total execution time is 743s (12 minutes).
- if you could export each sheet in a CSV file, the whole job would only take 43 seconds.
An improvement would be to read a list of sheets instead of a single sheet. May I ask you to create a feature request?
One Star

Re: Reading multiple Worksheets from the same Excel file

Thanks for your replies. I'll try both the plegall's solution and Volker's workaround.
It is coincidental that the worksheet I am working with is also 32MB. It has many more sheets but I want to read only first 4 or so.
I can convert this excel to a CSV but this is not an option with these large files. Each CSV will have to be saved separately which is too cumbersome.
One more quick question: Can one create an XL meta data same as for a CSV file or a table?
Thanks and I'll report back on this thread my findings. I'll also create a feature request for reading multiple worksheets within an excel file.
Regards,
Sean
Employee

Re: Reading multiple Worksheets from the same Excel file

One more quick question: Can one create an XL meta data same as for a CSV file or a table?

507
One Star

Re: Reading multiple Worksheets from the same Excel file

I tried plegall's method on my 32MB Excel file and it worked fine. It read all the 5 sheets for a total of 89K rows. It took 25 minutes but it is done.
Starting job m_FL_Combine_BRC_Layouts at 22:25 12/03/2008.
connecting to socket on port 3855 ...
connected
disconnected
Job m_FL_Combine_BRC_Layouts ended at 22:49 12/03/2008.
Thank you as always.
Regards,
Sean
Employee

Re: Reading multiple Worksheets from the same Excel file

I have added 3359
Still on my 3 sheets Excel file, the total duration goes from 12 minutes to 4 minutes (so we understand that loading the file takes the biggest part of the job time).
One Star

Re: Reading multiple Worksheets from the same Excel file

Really really nice! This will help me a lot. Thanks for doing this as I got into something else and did not get a chance to open this enhancement request.
Do you know around when you will be releasing 2.4? I am going to put the current solution in production but I'll do it in such a way that I can upgrade the moment we have 2.4 GA. The other issue I am having is that the Excel output loses the leading zeros on numbers formatted as strings (yes, I am getting greedy here :-)
Again, thank you for taking care of this.
Regards,
Sean
Employee

Re: Reading multiple Worksheets from the same Excel file

Do you know around when you will be releasing 2.4? I am going to put the current solution in production but I'll do it in such a way that I can upgrade the moment we have 2.4 GA.

Our release period currently last 4 months. 2.3.0 was out in 2008, February 2.4.0 should be out on 2008, June. Nothing is official, we may change this date.
I'm sure you'll find it's a long time to wait, so I suggest you to copy tFileInputExcel files from 12457 and relaunch TOS.
The other issue I am having is that the Excel output loses the leading zeros on numbers formatted as strings (yes, I am getting greedy here :-)

One question, one topic. Please create a new topic :-)
One Star

Re: Reading multiple Worksheets from the same Excel file

I'll copy the tFileInputExcel from your link. 300% speed-up is too compelling. I'll post my experience here.
For the other issue, point taken. I do have a bug tracker open on that!
Regards,
Sean
One Star

Re: Reading multiple Worksheets from the same Excel file

Sorry for this newbie question.
I tried to update my TOS install with the new files on TRAC. . It was clear as to where to install 4 of the files except for:
plugin.xml and
MoveExcelSheetnameToSheetList.java.
So I put them in the same directory as the other files which for me is:
J:\Talend\TOS-All-r11699-V2.3.1\plugins\org.talend.designer.components.localprovider_2.3.1.r11699\components\tFileInputExcel\
It obviously is wrong as I could not find any pre-existing files of this name there and I could not load this component in TOS. Would you please tell me what I am doing incorrect? Is the Java file to fix earlier mappings that use the older component? I am still getting familiar with TOS structure.
Thanks
Sean
Employee

Re: Reading multiple Worksheets from the same Excel file

I tried to update my TOS install with the new files on TRAC. . It was clear as to where to install 4 of the files except for:
plugin.xml and
MoveExcelSheetnameToSheetList.java.

You don't need them.
It is the migration task that moves the previous "sheetname" to the first line of the "sheetlist". You'll simply have to do it manually.
One Star

Re: Reading multiple Worksheets from the same Excel file

I was able to run with this change. And my results are fantastic (almost 7 to 8 times improvement).
Starting job m_FL_Combine_BRC_Layouts_New at 22:10 20/03/2008.
Job m_FL_Combine_BRC_Layouts_New ended at 22:13 20/03/2008.
This is the same job that took 25 minutes. Now it is done in about 3 minutes.
Again, fantastic product and a great support team/community. Thank you!
Regards,
Sean