tFileInputExcel with more than approx. 2500 rows gives stackoverflow

One Star

tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Does anyone else have problems with excel as input? It seems to be the Pattern (regex) which gives a stackoverflow when there is a bit more than 2500 rows. I have 7000 rows, but the error always comes after approx. 2500 rows have been processed. Exporting to CSV and doing the same parsing gives me no problems. As a sidenote, it seems much faster to process CSV.
Moderator

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Hi,
How do you set your component tFileInputExcel? The purpose of Using Regex is that: select this check box if you want to use a regular expression to filter the sheets to process. Would you give us your screenshot for your job?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Hello
I dropped it again due to the problem, and went for CSV.
I was of course aware of not enabling or using regexp anywhere in this test.
The scenario is simple: Make an excel with 7000 rows, and lets say, five columns. Content can be anything, and even the same for each row. Read it in and just use logRow.
Regarding patterns it is specific to this component. I have since used the same data from CSV, where I converts it through tReplace, through double regexp's in a tMap, and have also used tAggregateRow. No problems in getting through all records.
Moderator

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Hi,
Regarding patterns it is specific to this component. I have since used the same data from CSV, where I converts it through tReplace, through double regexp's in a tMap, and have also used tAggregateRow

You must be designed a job, would you minding uploading screenshot to us(especially the tMap). From your description, is the job flow tFileinputdelimited-->tReplace-->tMap-->tAggregateRow-->tLogrow, right? All is fine in .csv file but not excel? Need more info from you, thanks alot!
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

I am seeing a similar issue. I am loading a date dimension from an excel spreadsheet and the tFileInputExcel fails after about 2600 records.
One Star

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Thank you jmagana
To xdshi. There is nothing more to it than I stated. No advanced designs needed. Just try it.
Community Manager

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Hi Jojs
For testing, I am reading 10000 rows from excel file on v5.2.1 and it works, which version are you using? Do I miss something to reproduce the problem? What do you mean "Pattern (regex) which gives a stackoverflow"?
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Hello
It is excel 2007 (xlsx). I attach a job which fails, including test-data "test material.xlsx". (Ups, I can't attach zip-files?, well, the explanation below should be enough)
This test was run on mac OSX 10.8.2 with TOS 5.2.1.r95162
I did one additional test, where I copied only formats, numbers, dates and text into a new sheet, and deleted the original sheet. Now I can read all records, and it is much faster. This sheet is named "test material2.xlsx"
The sheet have the following columns:
id, Type, Color, Type title, Date, Revision, Title, Collection
I just realize that the first column with ids is made like this
1
=+A2+1
=+A3+1
...
When I replace that with pure numbers, it can be read with no problems.
Regarding the observation about regular expressions, I would like to quote from the documentation:
tFileInputExcel opens a file and reads it row by row to split data up into fields using regular expressions.
Now I tried to change the advanced setting for "Generation mode" from "Memory-consuming", to "Less memory consumed". That will also do the trick, and actually "Less memory consumed" reads all sheets faster than when using "Memory-consuming"
So I guess that formulas in the sheet and "Memory-consuming" do not work so well together.
And based on tests, it seems that "Less memory consumed" is faster anyway.
Best regards - Jojs
Moderator

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Hi Jojs,
It suggested that you should open a new topic for your issue so that more persons in forum will see it. In addition, could you upload your job screenshots into forum to help us to address your issue.
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

As simple as this attached. Does it tell you someting new?
I am not sure what you mean about starting a new topic? About what, and where, and for whom?
To me it sounds like a bug for "Memory-consuming" "Generation mode". But I would question the reason to even have that mode, since it is slower even for simple files. I suggest that Talend either remove that generation mode, fix the bug or document it with the component.
(On another note, there is a similar option when parsing XML-files, which also is questionable if it is about performance)

Best regards - Jojs
http://www.talendforge.org/forum/img/members/60190/mini_103940_Sk
Moderator

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Hi,
Thank you a comprehensive testing and summary.
Available when Read excel2007 file format (xlsx) is selected in the Basic settings view. Select the mode used to read the Excel 2007 file.
Generation mode:
Less memory consumed for large excel(Event mode): used for large file. This is a memory-saving mode to read the Excel 2007 file as a flow.
Memory-consuming (User mode): used for small file. It needs much memory. That is the reason
Now I tried to change the advanced setting for "Generation mode" from "Memory-consuming", to "Less memory consumed". That will also do the trick, and actually "Less memory consumed" reads all sheets faster than when using "Memory-consuming"
So I guess that formulas in the sheet and "Memory-consuming" do not work so well together.
And based on tests, it seems that "Less memory consumed" is faster anyway.

For more details, see the component reference tFileInputExcel
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star Jsi
One Star

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Memory-consuming (User mode): used for small file. It needs much memory. That is the reason

Yes, I read that. But that documentation does not propose any benefit of using that mode. I don't think that "It needs much memory" qualifies as a benefit. And what is a "small" file anyway?
Based on my small test, I see no benefit in that "small but memory-consuming" mode at all. In my test, it is not any faster (which would be a benefit).
My suggestion to remove it was more like a suggestion for an easy fix and less clutter at the same time. But I will leave that up to the Talend team.
(I apologise for mixing in a comment about xml. That DOES have some special purpose, although I have yet to finde examples taking advantage of it)
Community Manager

Re: tFileInputExcel with more than approx. 2500 rows gives stackoverflow

Hi Jsi,
Could you open an issue for the Doc team, in https://jira.talendforge.org/browse/DOCT, please?
We need to be more accurate in the documentation, it seems, and we also need to clarify with the dev team in which cases it is interesting to use the more "memory-consuming" mode. We missed that point, it seems.
Cheers,
Elisa