I have created a simple job:
tFileInputJSON --> (main) --> tLogRow
The source JSON file is around 2GB size. I even increased JVM in RUN to Xms2G, Xmx4G. But always fails with memory issue.
NOTE: the JSON works great if it is a simple file.
Is there a way to extract big size file? or is it a product limitation? I seen some articles for CSV file, nothing for JSON
Looking forward to hear some valuable answers. Thanks.
Solved! Go to Solution.
@rvkiruba,what is the system RAM size? and if you are using enterprise edition you can execute on remote system.
@manodwhb my PC has 8GB RAM. Is it not sufficient?
Win 7 Enterprise. Can you pls brief me about remote system?
Could you please let us know if this article helps?
Thanks for your reply. The article is not helping me. I get below error now.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
I too followed some other instructions and updated JVM like below, but no luck.
With 8 GB of memory available on 64-bit system, the optimal settings can be:
-Xms1024m -Xmx4096m -XX:MaxPermSize=512m -Dfile.encoding=UTF-8
Could you please try to open the Job to which you want to allocate more memory and in the Run view, open the Advanced Settings tab and select the Use specific JVM arguments box. Please allocate more memory to the active Job by double-clicking the default JVM arguments and editing them.
This change only applies for the active Job. The JVM settings will persist in the job script, and take effect when the job is exported and is executed outside Talend Studio.
Let us know if it is OK with you.
As i said in my previous post, i have made necessary changes to JVM in Run --> Advanced Settings and ran the job in my PC. Screenshot attached fyr.
Are you saying that, the JVM setting change work only when the job runs in cloud. Is that correct?
It means the change only applies for your current Job not for the whole studio. And the JVM settings will persist in the job script, and take effect when your job is exported and is executed outside Talend Studio(.bat or .sh file).
Thanks for sharing your solution with us on forum.
I am having same issue with json processing. My file size is 3GB. did you split the input at source?
Or you are processing the input file with 2GB and splitting it inside the job?
If you are having a complex JSON, it would be a good idea to have smaller files from source stage itself since it will avoid the overhead of splitting the big files to smaller files.
Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)
Check your PC's RAM size (I would suggest to have 16GB as min). Based on that, increase your JVM arguments size
-Xms (for ex, -Xms2048m)
-Xmx (for ex, -Xmx4096m)
1. go to RUN in your job
2. click "Advanced settings"
3. enable "Use specific JVM arguments"
4. change values
5. then 'run' job
split the file into smaller files and run
Let me know how it goes.
Option 1: increase Memory
I have tried increasing memory and it still fails. Server has total of 16GB memory and we have some more process running in the same server. I was able to allocate max of 4GB. Anything above that was failing to allocate memory. My inpur file is around 2.9 GB. Based on discussion with talend support, job execution itself take some memory. Apparently the json object is read as a single object of 2.9GB and load it to memory. This is the place where it fails.
Option 2: Split the input file into multiple files of smaller files.
Our POC was an input of size 2.9GB. We can do the split but for anything we need to use talend components only. Is there any way that we can do it in talend?
My thought is that we need to read the file at least once to process it and will fail there. Correct me if i am wrong here.
You could try using the component tJSONDocInputStream by Jan Lolling from Talend Exchange. It is designed specifically for reading very large files.
Thanks Fred for the suggestion. Finally we went ahead with splitting the file outside talend and process it. It worked fine for us. Proalem with the json file generated in talend was it was stored as a single json object and talend tries to load the object to memory. So we created file with each record as json object and process it to avoid the GC issue.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Watch the recorded webinar!
Pick up some tips and tricks with Context Variables
Learn how media organizations have achieved success with Data Integration
Learn how and why companies are moving to the Cloud