How to design a job with tens of millions of rows

Four Stars

How to design a job with tens of millions of rows

I want to process a CSV file consisting of some 80 million rows. Currently I do that limiting the number of rows read per run to 8 million (which is the max my machine can handle) using the tFileInputDelimited header and limit settings. I then use a batch file to call the job 12 times, using context parameters to specify the header and limit for that run.

 

Is there an easier way to do this?

 

 


Accepted Solutions
Fifteen Stars TRF
Fifteen Stars

Re: How to design a job with tens of millions of rows

I agree with @rhall_2_0.
To complete I suggest you to try this.
Use the parent job to split the input file into "small" pieces of 8 millions rows or less (see tFileOutputDelimied advanced settings).
Using a tFileList, iterate over the list of generated files, then pass the current filename to the child job.
Each file will be processed by the child job as proposed by @rhall_2_0.
Using this method you'll have less records to read (2 X 80 millions instead of 440 millions) but you'll also have to write 80 millions records.
You have to try to estimate the profit (or not).
Be careful if you use this technic because of the order the files will be accessed by tFileList.
Hope this helps.

TRF

All Replies
Community Manager

Re: How to design a job with tens of millions of rows

First of all, I like the way you have approached this problem. But there are ways in which you can make this a little easier. First of all, have you tested adjusting the JVM settings in the "Advanced" option on the "Run" tab? I assume you know about these. If not, try upping the Xmx value. Remember you can only assign the memory that you machine has available.

Assuming you have done this and the 8 million rows value is set according to that maximum, you can use a parent job to run your current job (as a child job) and loop through it using a tLoop component. This would save having a batch file and would make configuring it a lot easier. You will need to use a tRunJob component for this. 

 

Fifteen Stars TRF
Fifteen Stars

Re: How to design a job with tens of millions of rows

I agree with @rhall_2_0.
To complete I suggest you to try this.
Use the parent job to split the input file into "small" pieces of 8 millions rows or less (see tFileOutputDelimied advanced settings).
Using a tFileList, iterate over the list of generated files, then pass the current filename to the child job.
Each file will be processed by the child job as proposed by @rhall_2_0.
Using this method you'll have less records to read (2 X 80 millions instead of 440 millions) but you'll also have to write 80 millions records.
You have to try to estimate the profit (or not).
Be careful if you use this technic because of the order the files will be accessed by tFileList.
Hope this helps.

TRF
Four Stars

Re: How to design a job with tens of millions of rows

Thanks guys, I tried both approaches but ultimately went with the tFileList approach. I like how I can track the progress in the IDE, which isn't possible with a separate parent job.

 

The tRun approach looks nice because it can execute child jobs in different threads, though. I would imagine there to be efficiency benefits to using this approach, but I didn't notice any significant changes on my test runs.

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Put Massive Amounts of Data to Work

Learn how to make your data more available, reduce costs and cut your build time

Watch Now

How OTTO Utilizes Big Data to Deliver Personalized Experiences

Read about OTTO's experiences with Big Data and Personalized Experiences

Blog

Talend Integration with Databricks

Take a look at this video about Talend Integration with Databricks

Watch Now