TOS handling large volume of data

Highlighted
One Star

TOS handling large volume of data

Hi,

I am using Talend Open Studio for Data Integration version 5.4.1.

My requirement is to load large volumes (20 ? 70 million records) of data from flat files onto Oracle DB.
Once loaded, process them (Transformations etc) and load again to another set of tables in oracle DB as target.

My question is:
Q1 Smiley Tonguelan is to break large volume of data into smaller chunks and then load them and process.
I would like to know, experts opinion on this as to how best this can be achieved?
Also, please let me know the settings that need to be taken care in Talend Studio for handling such high volumes.

Q2: In case if one of the smaller chunks fails to load, is there a mechanism to start the job from the point of failure? If no built in components / mechanism are available, how best this scenario can be handled?

Please suggest.
NN.
Four Stars

Re: TOS handling large volume of data

Hi NN,

Is this activity one time or recurring?
Whether this data in flat files is distributed in multiple file or single file ?
What transformations you have?
How many tables are going to be generated by those files?
What is relationship among tables if they are?

Presently there is no mechanism to start from point of failure in TOS. But, if you plan your design, you can implement that.

Please provide the details. so that different options could be thought upon. ELT components, bulk execute are better way to load large data in faster way.

Thanks
Vaibhav
One Star

Re: TOS handling large volume of data

Thanks for the reply sanvaibhav.


Is this activity one time or recurring?
-->One Time Activity.

Whether this data in flat files is distributed in multiple file or single file ?
-->Multiple files.

What transformations you have?
-->Simple Transformations like(Replace,adding default values).

How many tables are going to be generated by those files?
-->Approximately 60 tables.

What is relationship among tables if they are?
-->Parent child.


Presently there is no mechanism to start from point of failure in TOS. But, if you plan your design, you can implement that
-->Can you please give us more inputs on the above statement.
Four Stars

Re: TOS handling large volume of data

TalendNN wrote:
Presently there is no mechanism to start from point of failure in TOS. But, if you plan your design, you can implement that
-->Can you please give us more inputs on the above statement.

You'd have to set up logging in your ETL job that keep track of the DI job, and logs starts, ends, and states (success, failures), to monitor the progress of the job. You'd then design your job to take a rerun flag and a particular job id (or the last run job id), and design your job to run through your ETL log and re-execute any failed steps that were logged. For every step in your job, you'd then design to take the rerun parameter and the last run job id and rerun the same steps in your job.
Four Stars

Re: TOS handling large volume of data

In addition to what willm has said, I have few more questions and ideas

Do you want to roll back your transactions for all the tables or only failed need to rolled back. If you design your job based on input data and output data, you can create some system tables. When you re-run your job, it would check the input data for particular ID and then start inserting the data after input id.

Say you have id 1-10 in source. 1-5 were inserted and some error occurred and the job failed. Through query, identify what is the last ID in target table, use that id to filter input data using tfilter component. Data greater than input ID would be allowed by filter to go to output and earlier would be filtered out.

You can think of similar logic but complex scenario so that when job re-runs it fetches only remaining data. You can also think of batch size and commit options and roll back components if required.

Thanks
Vaibhav
One Star

Re: TOS handling large volume of data

I want to write 11 million records onto file. Using file output delimted compoenent consumes 20 mins. is there any other compoenent to reduce the time.

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Have you checked out Talend’s 2019 Summer release yet?

Find out about Talend's 2019 Summer release

Blog

Talend Summer 2019 – What’s New?

Talend continues to revolutionize how businesses leverage speed and manage scale

Watch Now

6 Ways to Start Utilizing Machine Learning with Amazon We Services and Talend

Look at6 ways to start utilizing Machine Learning with Amazon We Services and Talend

Blog