Talend Job taking too much time for 30 million records

Seven Stars

Talend Job taking too much time for 30 million records

Hi,

I am executing a simple talend Job to import 30 million records from csv  file to MSSQL database. While we are doing it through Java Spring Batch jobs. It's taking around 4 hours. But When we are doing it through Talend Enterprize version on the same server, its taking around 9 hours.

 

We have tried almost each and every option - Batch Size, Parallelization, increasing Intial and maximum heap size, but in vane. The structure of the job is below.

 

tFileList --> tFileInputDelimited --> tMap --> tMSSQLOutput

 

The biggest priority of the organization while purchasing the license was to decrease the loading time.

Could someone please suggest, how we can improve the performance.

 

Best Regards,

Abhishek


Accepted Solutions
Sixteen Stars

Re: Talend Job taking too much time for 30 million records

I've not experienced poor performance like this when using MSSQL components, but I have seen posts from people who have had issues with the latest versions of Talend. Apparently adding "sendStringParametersAsUnicode=false" to the Advanced Settings of your component may help with performance. You will need to tweak the Batch Size and Commit Every options to fine tune.

 

Parallel execution will also help, but be sure to set your parallel processes to a max of n-1 (where n is the number of cores for you system).

 

You may also want to use the tParallelize component to enable you to processes more than one file at a time. But to do that you will need to add a mechanism to stop the same file from being read from twice concurrently. That's not too difficult, but you will need to implement this. A static routine to hold the file names and use only once will do. But doing this WITH the parallel execution on the db component will need to be balanced to make it an optimal. If you have 8 cores and try to read 4 files concurrently using the tParallelize component, you may want to have 3 db components set to 2 parallel executions and leave 1 with just 1 to reserve a core for controlling the process. However that is just a rule of thumb. Have a play and see what you can get.

 

If you implement all (or a combination of the above), you should see performance improvements.


All Replies
Sixteen Stars

Re: Talend Job taking too much time for 30 million records

I've not experienced poor performance like this when using MSSQL components, but I have seen posts from people who have had issues with the latest versions of Talend. Apparently adding "sendStringParametersAsUnicode=false" to the Advanced Settings of your component may help with performance. You will need to tweak the Batch Size and Commit Every options to fine tune.

 

Parallel execution will also help, but be sure to set your parallel processes to a max of n-1 (where n is the number of cores for you system).

 

You may also want to use the tParallelize component to enable you to processes more than one file at a time. But to do that you will need to add a mechanism to stop the same file from being read from twice concurrently. That's not too difficult, but you will need to implement this. A static routine to hold the file names and use only once will do. But doing this WITH the parallel execution on the db component will need to be balanced to make it an optimal. If you have 8 cores and try to read 4 files concurrently using the tParallelize component, you may want to have 3 db components set to 2 parallel executions and leave 1 with just 1 to reserve a core for controlling the process. However that is just a rule of thumb. Have a play and see what you can get.

 

If you implement all (or a combination of the above), you should see performance improvements.

Seven Stars

Re: Talend Job taking too much time for 30 million records

Thank you, Rhall.
Let me try all the options that you have suggested.


Best Regards,
Abhishek
Five Stars

Re: Talend Job taking too much time for 30 million records

Will it Improve the performance when we set sendStringParametersAsUnicode to false?

Seven Stars

Re: Talend Job taking too much time for 30 million records

Hi Rhall,
Also, I am using "Open source JTDS" in the JDBC provider. Should I use "Microsoft". I mean does it have any impact on the job performance.

Best Regards,
Abhishek


Sixteen Stars

Re: Talend Job taking too much time for 30 million records

Have you tried using the Microsoft one? I'm afraid I do not have a MS db to try this on at the moment

Seven Stars

Re: Talend Job taking too much time for 30 million records

Hi Rhall,
I have implemented the points as suggested by you. I am able improve the performance. 90 Lakh records we are able to migrate in 20 minutes.

Best Regards,
Abhishek
Sixteen Stars

Re: Talend Job taking too much time for 30 million records

That is some improvement. Glad it worked :-)

Four Stars

Re: Talend Job taking too much time for 30 million records

I have tried what you mentioned. I didn't see improvement in performance of the job. I want to load 4000 records from one MSSQL database to another. 2000 records will be updated and rest will be inserted. This process has to be completed with in 15 minutes. 

Seven Stars

Re: Talend Job taking too much time for 30 million records

For info, I experienced this problem with updates and discovered setting the pparameter "sendStringParametersAsUnicode=false" on the MS SQL connection component resolved the problem.

 

The problem stems from the index in use was on a VARCHAR column, but was being queried as an NVARCHAR column, which meant it ignored the index and resulting in it doing a full table scan.

Regards

David

Don't forget to give Kudos when an answer is helpful or the solution.
Four Stars

Re: Talend Job taking too much time for 30 million records

I had added sendsendStringParametersAsUnicode=false in the advanced settings of the destination component. It is processing at the rate of 2 rows/sec. Please suggest a solution
Seven Stars

Re: Talend Job taking too much time for 30 million records

The parameter needs to go on the “Additional Parametrs” when making the database connection

Regards

David

Don't forget to give Kudos when an answer is helpful or the solution.
Seven Stars

Re: Talend Job taking too much time for 30 million records

Also, are you inserting or updating or both?

 

make sure you split out inserts and update into 2 flows and set each output component as either insert or update not “insert or update”.

Regards

David

Don't forget to give Kudos when an answer is helpful or the solution.
Four Stars

Re: Talend Job taking too much time for 30 million records

I do both

Seven Stars

Re: Talend Job taking too much time for 30 million records

So your pattern should be something like:

 

t{DBtype}Input (work out existing rows from incoming data)

                                    |

t{something}Input -> tmap (join lookup to incoming data to split inserts & updates)

                                                                                 - output inserts-> t{DBtype}Output (Insert mode only)

                                                                                 - output updates-> t{DBtype}Output (Update mode only)

Regards

David

Don't forget to give Kudos when an answer is helpful or the solution.
Sixteen Stars

Re: Talend Job taking too much time for 30 million records

2 rows a second is painfully slow. Is your target database local or in the cloud? How big are your rows? What is your read speed like (from your source system)? Can you carry out an insert with another tool any quicker (from the location that you are running Talend)?