I have created an ETL job in which there are 2 places where log entries are made. First in a text file and the second in a database table. This logging is different from the logging done through Talend itself (Stat, Log or FlowMeter).
For writing to the log file we have created a Talend Routine that gets called through the job itself.
For writing the entries to the database, we have created a separate job that is called throughout the ETL job as a subjob.
This job (let's call it log job) is a simple job as follows:
The logs_Input component gets the necessary data to make the log entry from the calling/main job.
The problem is that since we have added this subjob to the ETL job, the performance of the job has gone down drastically.
If the ETL job, without the subjob, would complete the execution in say 30 seconds then with the subjob added the job execution takes around 1 minute extra to complete. The log job is called multiple times throughout the ETL job so some performance degradation is expected. But we would still like to improve the performance and would like to know if there is another way to improve the performance.
If you use Talend enterprise version, prefere to use a joblet than an external job.
Due to the design, you have to pay a lot for each information you want to log.
Most expensive operations are:
- call the subjob (must start a new process, copy the context variables, etc.)
- connect to the database (if you don't share the connection between parent and child job)
- insert row into the log table
- commit the transaction
- return to the calling job
Multiply by the number of logged rows, you have the total cost.
If you use Talend community, copy/past the sequence into each job (I know, it's not fun) else, to a joblet.
Verify the database connection is open for the job.
If the gain is not sufficient, replace database transactions by tFileOutputDelimited (with buffer size = 1) and add Postjob sequence to write all the records from the CSV file to the database at once.
Also, push the messages from the job using tWarn and catch them using tLogCatcher.
Hope this helps.
without full picture it hard to give a full answer, but few general moments:
1) it is logs, so csv - is fasted way, do You really need database logs? (it just a question)
2) this is more serious - when You call SubJob You each time - create and drop environment, including connect to database. in best case You close connection, in worst - it will be dropped by timeout. it take a time, and it take time seriously
For improve p2:
1) try to stop use SubJob, open connection at the start of Job, use it on each log component, close connection at the end
2) try to play with shared connections between parent and child Job (it other have some restrictions)
3) as test variant - append log to csv file, and write into database at the end of parent Job
We are already maintaining the logs in a file. But as the file is not readily available to everyone, the client has asked for making entries in the database.
If I open a separate connection and use it for logging, then additional time will be taken by the job for creating the connection at the start and closing it at the end of the job.
I am now thinking of writing a Routine method that will perform the same task. Any idea if that will reduce the execution time since there won't be creation of a separate process or creating/dropping context.
You do not understand, what about connection - it was mentioned - 1 for all
but in any case if You already have csv - just insert all of them at the end of parent job into table
single operation, very fast
I would love to do that, unfortunately the client has the support team to tracking the log entries made in the database. So if I make all the entries at the very end they would not know if the job started or not until the very end.
About shared connection, I will look into it.
I tried your suggestion of using shared database connection. So this is what I did:
Now when I run the job, I get the following exception:
Child job returns 1. It doesn't terminate normally.
Exception in component tMSSqlConnection_1
java.sql.SQLException: The syntax of the connection URL 'jdbc:jtds:sqlserver://;appName=ParentJob;' is invalid.
But if I keep the Transmit whole context check box checked then there is no error. Any idea why this could be happening?
It is because you are missing populating a particular Context variable. Transferring the Context is not going to slow you down by any amount you would notice. If you have gone from loading logs in realtime to storing them in memory and dumping them to a DB using a tPostJob, you really have no need to save the milliseconds you might by not transferring the whole context.
My assumption is that your DB connection is configured using Context variables and that it is one or many of those that is not being passed. By all means just pass the relevant values, but I doubt you will see a tremendous difference if I'm honest.