I want to process a CSV file consisting of some 80 million rows. Currently I do that limiting the number of rows read per run to 8 million (which is the max my machine can handle) using the tFileInputDelimited header and limit settings. I then use a batch file to call the job 12 times, using context parameters to specify the header and limit for that run.
Is there an easier way to do this?
Solved! Go to Solution.
First of all, I like the way you have approached this problem. But there are ways in which you can make this a little easier. First of all, have you tested adjusting the JVM settings in the "Advanced" option on the "Run" tab? I assume you know about these. If not, try upping the Xmx value. Remember you can only assign the memory that you machine has available.
Assuming you have done this and the 8 million rows value is set according to that maximum, you can use a parent job to run your current job (as a child job) and loop through it using a tLoop component. This would save having a batch file and would make configuring it a lot easier. You will need to use a tRunJob component for this.
Thanks guys, I tried both approaches but ultimately went with the tFileList approach. I like how I can track the progress in the IDE, which isn't possible with a separate parent job.
The tRun approach looks nice because it can execute child jobs in different threads, though. I would imagine there to be efficiency benefits to using this approach, but I didn't notice any significant changes on my test runs.
Join us live for a sneak peek!
Accelerate your data lake projects with an agile approach
Create systems and workflow to manage clean data ingestion and data transformation.
Introduction to Talend Open Studio for Data Integration.