My current project requires that I run several jobs 24/7. I have suggested (and currently implemented) that we bring the jobs down once a day with a 5 minute gap before the next schedule kicks the jobs off again. My reasoning was so that the TAC logs can be flushed and any JVM memory utilized by each job can be cleared. The TAC logs get huge, so my concern is that we will run into more issues if we never bring the jobs down than if we utilize a more standard ETL approach. Is my reasoning correct and what is the risk of letting the jobs run 24/7 with a stop restart once per week? My jobs are not setup in an infinite loop, but a conditional loop based on runflags in a db table. I have two jobs that set the runflag bit to a 0 or a 1 to stop/start the jobs respectively. Each job has a subscription name they use to query the runflag status of this control table. The stop job (sets all flags to 0) runs at 11 PM and most jobs (there are currently 8 of them) usually come down within the first two minutes. At 11:05 PM, the 'start' job flips all the flags to 1 and the rest of the jobs are staggered starting at 11:06 PM so I don't cause a huge spike in CPU and memory during the initialization process. If a long running DB issue occurs and a job loop misses the 5 minute 'downtime', then the next schedule will not get kicked off anyway since the schedules are paused while a job is running, so I don't risk anything there.
Any feedback/suggestions to the current process vs the stop/start once per week?
Also, on another note, we are using 6.3.1 and are currently seeing a bug in the TAC where the jobs end successfully (and is shown that way in the logs), but the TAC still sees the jobs as 'running', effectively pausing the schedules until manual intervention. The only way to recover is to restart the TAC tomcat, but that causes the file trigger I have for one of my jobs to stop working until I delete the trigger and recreate it with a new one. This happens across all environments and usually occurs about once per month. Anyone else seeing this same issue?
You use case looks more like a realtime scenario to be implemented with services (routes/webservices) that run 24/7. With services paradigm, especially Mediation Routes, you can have the services running 24/7 without needing a restart etc.
The logic you have mentioned below with DI works, but it feels clunky, and you have elaborated the issues.
A DI batch job is expected to run for a long time. However, it is also expected to finish at some point. The memory build up will depend on the components used and the way they are composed together into a job.
The challenge as you put it is the TAC logs. In fact, while your job is running on the jobserver, the logs are written on the jobserver. When the job finishes, i.e. the JVM process ends, then the log is copied from the JobServer to the TAC. Of course, there is some streaming of logs while it is running, but if the jobs never end, then the TAC never receives the completed logs because it is ongoing. And a network functuations will cause the socket connections to close. That's why the TAC wait for the jobs to complete end, to again transfer all the logs again. That may explain why you see the jobs as finished successfully on the job server, but the jobs shows as still running on the TAC. The Job is not finished until the complete log is transfered to the TAC. Thus, if you logs are huge, that will take some time to transfer over the network from your JobServer to your TAC, i.e. your job process will end, then the logs get copied, and when all maintenance done, then the job will switch to finish status in the TAC. Sometimes if the log is huge, and there are network fluctuations, then the TAC will wait until the log is completely transfered.
Are you task in Job Conductor set to WAIT for the setting on Unavailable JobServer? If they are, you should set it to reset based on the way you have set them up with your looping etc every 5 mins. You can read more about that in the documentation.
My recommendation is to do a proper service design with Mediation Route, cTalendJob and have your 24/7 system instead of building a clunky solution.
How many records do you process in these jobs every 5 mins? Or even per minute?
Thanks for your response!
7 of the 8 jobs are CDC jobs that read from 6 different sources into our ODS. The other is CDC from the ODS into our EDW. I have built a custom MS SQL Server CDC component so it utilizes the native CDC for SQL Server, not the trigger based CDC option that Talend provides with the subscription. It is a very typical CDC setup that I have also implemented in SAP BODS and other ETL tools, so I chose this setup based on the historical success of this architecture. My clients only have the DI subscription, and I assume the mediation routes you are referring to are ESB components, which my clients have already determined they do not want to pay the price for the ESB runtime. That being said, can we migrate a typical DI CDC loop to this Mediation Route (CDC by nature is a pull so there is nothing to trigger a push service request)? As far as volume: We process up to 5000 records for one CDC cycle for one source (the job sleeps for 60 seconds at the end of every cycle before checking the runflag status). Currently, we only have 8 tables that have CDC on them in PROD, but will be adding an additional 10 more for the next migration phase.
Your assessment is correct as far as the jobs being realtime (or in my clients case I expressly state 'near-realtime'). My clients customers expect as realtime data as they can get and this is what was promised. With them being a global company, their customers span the globe so there is no real good time for a lengthy recycle of the jobs.
The interesting issue about the TAC bug, is that it is not reproducible, but almost always occurs once a month. We have been in PROD for a month now and we noticed this bug affected TEST and PROD the exact same day (We do not run jobs in DEV on a schedule except for design testing). This made me curious if there was some sort of Talend update that runs to check for new releases that could be having an affect on the TAC. I don't believe in coincidences, especially since we started PROD up the same day we reset the TEST boxes, so they seem to be syncronized right now as it relates to the bug...
What I would do is as follows:
- Build a route with a timer component to trigger every 60 seconds. Read the changes from the CDC and push each change as a message into a queue or topic. It will be fast and very simple. This route when you deploy it, it runs by itself, no connection with TAC.
- Build routes to read messages from the queue or topic and then process 1 message at a time. With this solution, you can scale horizontally by deploying multiple instances of the routes (be careful though to have each instance read different messages from the queue). Because you process 1 message at a time, you are never using too much memory. And with multiple instances, you can easily scale this to you need.
Again, this is just an idea. There may be more requirements you need to look into to do this. Like whether there are dependencies in the order of messages arriving etc. However, this is a more scalable architecture. And you can build as many services as you have source/target mappings. If you service fails, then the message stays in the queue. You restart your service or start another instance, and the process starts from where it reached since you processing 1 message at a time. It is a different paradigm. It will also scale when you have 20K or 30K changes per minute. Your messages will just accumulate, and I know people have developed complex setup where depending on the number of messages on the queue, they will start additional instances of the routes to process the message faster, and then shut them down at the end and keep just 1 instance. This is also a more cloud friendly architecture if you ever decide one day to run in AWS, Azure or GCP.
Your scenario is a realtime one.
I understand the fact that there is a cost to the ESB Runtime, but it is minimal when you consider the cost to the business if you lose a customer because of that system. Also, add the cost of restarting the TAC on a regular basis, having people to monitor during the weekend that it has restarted, and the challenge if some other issues happened during the restart, etc. etc. All of these "extra" unmeasured effore maybe goes up to 3x the cost of an ESB runtime.
Regarding TAC: Since it affected both TEST and PROD at the same time, I will look at the network monitoring and any OS updates / Group Policy Update / Antivirus Update / Manual activity on some server disks etc.. Talend does not push update to TAC automatically. It requires manual intervention. Do you have the platform edition which enables you to cluster the TAC scheduler for Scheduler HA?
Again, thanks for your time and your suggested solution.
I like the concept that you suggested and obviously with CDC, sequentially processing the records in order is paramount, or you lose the integrity of the change data. And while I appreciate the design of your suggestion, at the end of the day, I am a consultant and, as such, I have to work within the confines of the client's toolset and budget. DI is the backbone of all of their projects and they are not going to add ESB runtimes as overhead for a solution that is operating as designed in the current tool without sufficient reasoning to move. The bug we are seeing in the TAC is not related to the scheduling, but we are setting up a meeting with a few Talend engineers to take a look at the system setup and see what could be causing the issue.
Thanks for your time,