Four Stars

How to design a job with tens of millions of rows

I want to process a CSV file consisting of some 80 million rows. Currently I do that limiting the number of rows read per run to 8 million (which is the max my machine can handle) using the tFileInputDelimited header and limit settings. I then use a batch file to call the job 12 times, using context parameters to specify the header and limit for that run.

 

Is there an easier way to do this?

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Twelve Stars TRF
Twelve Stars

Re: How to design a job with tens of millions of rows

I agree with @rhall_2_0.
To complete I suggest you to try this.
Use the parent job to split the input file into "small" pieces of 8 millions rows or less (see tFileOutputDelimied advanced settings).
Using a tFileList, iterate over the list of generated files, then pass the current filename to the child job.
Each file will be processed by the child job as proposed by @rhall_2_0.
Using this method you'll have less records to read (2 X 80 millions instead of 440 millions) but you'll also have to write 80 millions records.
You have to try to estimate the profit (or not).
Be careful if you use this technic because of the order the files will be accessed by tFileList.
Hope this helps.

TRF
3 REPLIES
Twelve Stars

Re: How to design a job with tens of millions of rows

First of all, I like the way you have approached this problem. But there are ways in which you can make this a little easier. First of all, have you tested adjusting the JVM settings in the "Advanced" option on the "Run" tab? I assume you know about these. If not, try upping the Xmx value. Remember you can only assign the memory that you machine has available.

Assuming you have done this and the 8 million rows value is set according to that maximum, you can use a parent job to run your current job (as a child job) and loop through it using a tLoop component. This would save having a batch file and would make configuring it a lot easier. You will need to use a tRunJob component for this. 

 

Rilhia Solutions
Twelve Stars TRF
Twelve Stars

Re: How to design a job with tens of millions of rows

I agree with @rhall_2_0.
To complete I suggest you to try this.
Use the parent job to split the input file into "small" pieces of 8 millions rows or less (see tFileOutputDelimied advanced settings).
Using a tFileList, iterate over the list of generated files, then pass the current filename to the child job.
Each file will be processed by the child job as proposed by @rhall_2_0.
Using this method you'll have less records to read (2 X 80 millions instead of 440 millions) but you'll also have to write 80 millions records.
You have to try to estimate the profit (or not).
Be careful if you use this technic because of the order the files will be accessed by tFileList.
Hope this helps.

TRF
Four Stars

Re: How to design a job with tens of millions of rows

Thanks guys, I tried both approaches but ultimately went with the tFileList approach. I like how I can track the progress in the IDE, which isn't possible with a separate parent job.

 

The tRun approach looks nice because it can execute child jobs in different threads, though. I would imagine there to be efficiency benefits to using this approach, but I didn't notice any significant changes on my test runs.