One Star

File size

I am trying to run some CSV data through the tool and it keeps crashing at the end of the adding dataset process.  The file itself is about 2.8GB, which is what I may end up dealing with if I pull from an HDFS datastore.  Is there a maximum file size that the tool can ingest and is there a way to tune the Java stack?

14 REPLIES
Community Manager

Re: File size

What's the error message? There is no clear limitation on file size, normally, you might get outOfMemory exception for processing large of data set. Can you please provides more details about your job?
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: File size

When the file load gets to 100%, the application just throws a generic error. 
"Server error
 An error occurred"


There's an 'x' to close the message and that's it.  The file is being loaded locally, can be opened in a text application, and iterated through in other applications just fine.  The system being used is Windows 7 64-bit with an i7 processor, and 16GB of RAM.
Employee

Re: File size

Can you please go do this:

Quit data prep
Delete the app.log file
Run data prep and reproduce your scenario
Attach your app.log here

So where is app.log? On Windows: C:\Users\\AppData\Roaming\talend\dataprep\logs\app.log
Let me know if you are on Mac (the file is hidden but there is a trick).
Don't worry, app.log only captures the software's activity and detailed error logs, there are no information about your data. Feel free to check it out in a text editor too.

Free Desktop is designed to load the entire dataset in memory; as a safeguard however, it will arbitrarily limit a preparation to 10,000 rows. This is of course not a hard limit (our code being open source that would be silly wouldn't it), just a measure to prevent users from crashing the app... Well obviously that didn't work well in your case :-)  The app.log should tell us what went wrong.
Thanks!
One Star

Re: File size

The log file should be attached, but in a cursory review it looks like it is Java heap space.
app.log.log
Employee

Re: File size

I can't seem to see the attachment. Can you try to zip it perhaps?
I am looking forward to looking into it because, again, DP Free Desktop precisely cuts off at 10K to avoid your very error. So I'd love to understand why it doesn't do it in your case.
One Star

Re: File size

We'll see if this works.  I noticed when I uploaded the file it added the extension twice.

appLog.zip_20160216-1430.zip
The link should be: www.talendforge.org/forum/img/members/314254/appLog.zip_20160216-1430.zip
Employee

Re: File size

Hi Brian
How many rows does the file contain (roughly) ?
Thanks
One Star

Re: File size

There are about 8.8 million lines in the file, which isn't atypical for the data I work with.
Employee

Re: File size

To add to the previous question (sorry for the hassle):
1) what is the actual content and format of the file, is it a list of transactions, or a log file etc?
2) what is the delimiter? how many cols do you see in a text editor?
3) where is your file located, local drive?
Any chance you can share the first row, or first 2,3 or even 10 rows, after masking what you (understandably!) wouldn't share on a public forum?
Just trying to understand what is so special with this file. There is indeed an out of memory error and this is pretty unexpected.
One Star

Re: File size

The file is being loaded from a local drive.  I would characterize the file contents as being 72 columns wide (not my data) x 8.8M rows, comma delimited with quoted text, and the widest column has a max length of 896 (avg is 44).  Data Prep reads the file as UTF-8, though it's actually ISO-8859-1 (aka Latin1).
A slightly different problem I encountered was with a smaller, less complicated CSV was with exporting prepped data.  That file has about 400k lines, but I could only export 10k lines to a new file.
Employee

Re: File size

Thanks Brian. I will generate a file (with Talend Open Studio ;-) ) similar to yours and have it looked into. The Free Desktop version of Talend DP is meant for personal use on reasonably sized files, but I still want the behavior of the software to be sound and predictable on larger files. I need to understand why it doesn't do that with yours.
We use heuristics to guess the character encoding. Unfortunately there is no deterministic method, so it may, or may not, guess correctly. You may manually change the automatic encoding by clicking on the little "gear" icon next to the dataset name, in the upper left of the preparation screen.
Wrt the 10K in export: yes we cut off automatically at 10K to avoid out of memory errors, so you can only prepare 10K out of your 400K original file, and therefore this is all you have for export too. This is btw what should have happened with your 8.8M file too.
You could increase or decrease the 10K threshold by editing this file: \config\application.properties
Note: The commercial, server version of the software due in June this year, will have no limitation with volume. NOT because we put an artificial limit in Free Desktop. It is freely configurable and our code is open source (not to mention it would be contrary to our code of conduct). But only because we have invested in more sophisticated scalability techniques in the server version.
Employee

Re: File size

I confirm we have an issue, we indeed consume too much memory when assessing the format of the file in certain circumstances. This stage happens before the actual import of the file which cuts off at 10K rows. A fix is already in the work.
One Star

Re: File size

Good to hear that you found what is causing the problem.  I was able to adjust the input/output to 400k without any problems.
One Star

Re: File size

Hello,
if this is still active topic, I have DP 1.3 and with 8GB RAM PC I have serious downspeed when I open 30krow/20col csv.
I also tried to open 700k row with adjusting sample size in config on 700k and it is not possible to show this ammount of data.
Should I or can I adjust more java heap in DP config to handle bigger samples of data?
Thank you