My current project requires that I run several jobs 24/7. I have suggested (and currently implemented) that we bring the jobs down once a day with a 5 minute gap before the next schedule kicks the jobs off again. My reasoning was so that the TAC logs can be flushed and any JVM memory utilized by each job can be cleared. The TAC logs get huge, so my concern is that we will run into more issues if we never bring the jobs down than if we utilize a more standard ETL approach. Is my reasoning correct and what is the risk of letting the jobs run 24/7 with a stop restart once per week? My jobs are not setup in an infinite loop, but a conditional loop based on runflags in a db table. I have two jobs that set the runflag bit to a 0 or a 1 to stop/start the jobs respectively. Each job has a subscription name they use to query the runflag status of this control table. The stop job (sets all flags to 0) runs at 11 PM and most jobs (there are currently 8 of them) usually come down within the first two minutes. At 11:05 PM, the 'start' job flips all the flags to 1 and the rest of the jobs are staggered starting at 11:06 PM so I don't cause a huge spike in CPU and memory during the initialization process. If a long running DB issue occurs and a job loop misses the 5 minute 'downtime', then the next schedule will not get kicked off anyway since the schedules are paused while a job is running, so I don't risk anything there.
Any feedback/suggestions to the current process vs the stop/start once per week?
Also, on another note, we are using 6.3.1 and are currently seeing a bug in the TAC where the jobs end successfully (and is shown that way in the logs), but the TAC still sees the jobs as 'running', effectively pausing the schedules until manual intervention. The only way to recover is to restart the TAC tomcat, but that causes the file trigger I have for one of my jobs to stop working until I delete the trigger and recreate it with a new one. This happens across all environments and usually occurs about once per month. Anyone else seeing this same issue?
You use case looks more like a realtime scenario to be implemented with services (routes/webservices) that run 24/7. With services paradigm, especially Mediation Routes, you can have the services running 24/7 without needing a restart etc.
The logic you have mentioned below with DI works, but it feels clunky, and you have elaborated the issues.
A DI batch job is expected to run for a long time. However, it is also expected to finish at some point. The memory build up will depend on the components used and the way they are composed together into a job.
The challenge as you put it is the TAC logs. In fact, while your job is running on the jobserver, the logs are written on the jobserver. When the job finishes, i.e. the JVM process ends, then the log is copied from the JobServer to the TAC. Of course, there is some streaming of logs while it is running, but if the jobs never end, then the TAC never receives the completed logs because it is ongoing. And a network functuations will cause the socket connections to close. That's why the TAC wait for the jobs to complete end, to again transfer all the logs again. That may explain why you see the jobs as finished successfully on the job server, but the jobs shows as still running on the TAC. The Job is not finished until the complete log is transfered to the TAC. Thus, if you logs are huge, that will take some time to transfer over the network from your JobServer to your TAC, i.e. your job process will end, then the logs get copied, and when all maintenance done, then the job will switch to finish status in the TAC. Sometimes if the log is huge, and there are network fluctuations, then the TAC will wait until the log is completely transfered.
Are you task in Job Conductor set to WAIT for the setting on Unavailable JobServer? If they are, you should set it to reset based on the way you have set them up with your looping etc every 5 mins. You can read more about that in the documentation.
My recommendation is to do a proper service design with Mediation Route, cTalendJob and have your 24/7 system instead of building a clunky solution.
How many records do you process in these jobs every 5 mins? Or even per minute?
Thanks for your response!
7 of the 8 jobs are CDC jobs that read from 6 different sources into our ODS. The other is CDC from the ODS into our EDW. I have built a custom MS SQL Server CDC component so it utilizes the native CDC for SQL Server, not the trigger based CDC option that Talend provides with the subscription. It is a very typical CDC setup that I have also implemented in SAP BODS and other ETL tools, so I chose this setup based on the historical success of this architecture. My clients only have the DI subscription, and I assume the mediation routes you are referring to are ESB components, which my clients have already determined they do not want to pay the price for the ESB runtime. That being said, can we migrate a typical DI CDC loop to this Mediation Route (CDC by nature is a pull so there is nothing to trigger a push service request)? As far as volume: We process up to 5000 records for one CDC cycle for one source (the job sleeps for 60 seconds at the end of every cycle before checking the runflag status). Currently, we only have 8 tables that have CDC on them in PROD, but will be adding an additional 10 more for the next migration phase.
Your assessment is correct as far as the jobs being realtime (or in my clients case I expressly state 'near-realtime'). My clients customers expect as realtime data as they can get and this is what was promised. With them being a global company, their customers span the globe so there is no real good time for a lengthy recycle of the jobs.
The interesting issue about the TAC bug, is that it is not reproducible, but almost always occurs once a month. We have been in PROD for a month now and we noticed this bug affected TEST and PROD the exact same day (We do not run jobs in DEV on a schedule except for design testing). This made me curious if there was some sort of Talend update that runs to check for new releases that could be having an affect on the TAC. I don't believe in coincidences, especially since we started PROD up the same day we reset the TEST boxes, so they seem to be syncronized right now as it relates to the bug...
What I would do is as follows:
- Build a route with a timer component to trigger every 60 seconds. Read the changes from the CDC and push each change as a message into a queue or topic. It will be fast and very simple. This route when you deploy it, it runs by itself, no connection with TAC.
- Build routes to read messages from the queue or topic and then process 1 message at a time. With this solution, you can scale horizontally by deploying multiple instances of the routes (be careful though to have each instance read different messages from the queue). Because you process 1 message at a time, you are never using too much memory. And with multiple instances, you can easily scale this to you need.
Again, this is just an idea. There may be more requirements you need to look into to do this. Like whether there are dependencies in the order of messages arriving etc. However, this is a more scalable architecture. And you can build as many services as you have source/target mappings. If you service fails, then the message stays in the queue. You restart your service or start another instance, and the process starts from where it reached since you processing 1 message at a time. It is a different paradigm. It will also scale when you have 20K or 30K changes per minute. Your messages will just accumulate, and I know people have developed complex setup where depending on the number of messages on the queue, they will start additional instances of the routes to process the message faster, and then shut them down at the end and keep just 1 instance. This is also a more cloud friendly architecture if you ever decide one day to run in AWS, Azure or GCP.
Your scenario is a realtime one.
I understand the fact that there is a cost to the ESB Runtime, but it is minimal when you consider the cost to the business if you lose a customer because of that system. Also, add the cost of restarting the TAC on a regular basis, having people to monitor during the weekend that it has restarted, and the challenge if some other issues happened during the restart, etc. etc. All of these "extra" unmeasured effore maybe goes up to 3x the cost of an ESB runtime.
Regarding TAC: Since it affected both TEST and PROD at the same time, I will look at the network monitoring and any OS updates / Group Policy Update / Antivirus Update / Manual activity on some server disks etc.. Talend does not push update to TAC automatically. It requires manual intervention. Do you have the platform edition which enables you to cluster the TAC scheduler for Scheduler HA?
Again, thanks for your time and your suggested solution.
I like the concept that you suggested and obviously with CDC, sequentially processing the records in order is paramount, or you lose the integrity of the change data. And while I appreciate the design of your suggestion, at the end of the day, I am a consultant and, as such, I have to work within the confines of the client's toolset and budget. DI is the backbone of all of their projects and they are not going to add ESB runtimes as overhead for a solution that is operating as designed in the current tool without sufficient reasoning to move. The bug we are seeing in the TAC is not related to the scheduling, but we are setting up a meeting with a few Talend engineers to take a look at the system setup and see what could be causing the issue.
Thanks for your time,
Concerning your TAC issue: I've seen the same issue in 6.1.1 and we have just upgraded to 6.4.1. I have a file trigger that works great but after an outage or some other scenario where the TAC is restarted - some of the file triggers stop working. They still fire every 15 minutes as scheduled and the job looks good - but in reality they never see the file setting there waiting to be processed. The only work-around is as you said - we delete the trigger and re-create it.
Let me know what the engineers say about this issue. The fact it just stops working and files stack up is very concerning.
I've just seen this and have to agree with what @iburtally said, your client is using the wrong tool for a real time solution. This is essentially the same as using an F1 car to pull a caravan across Europe. It might work for a while, but it won't be reliable. This should be explained to your client.
If you want to prove the solution that Irshad suggested without incurring am immediate cost, you can always download the open source ESB solution and try it out. Your existing jobs should easily be imported into the ESB version (so long as the version is the same or newer). You will obviously not be able to run it on your current TAC, but you will be able to see how much more efficient his solution will be.
With regard to the TAC issue, it does seem odd that this happens at the same time on separate environments. I assume that they are completely isolated from each other? As Irshad said, this could be an environment issue where something is happening to trigger this problem. However, I assume you have looked into that. The other thing I have seen is an issue with the default Tomcat memory config for Tomcat. It isn't a great deal at all. I always raise this in my environments and I have seen Tomcat errors because of this. I'd suggest taking a look into that.
If you have issues with having to automate the config of things in the TAC (your trigger, for example), the TAC provides the MetaServlet API (https://help.talend.com/reader/rJGzSCBb8MvnaZHhs978KQ/PMoHeNdt5qac07VehVViDA) which is priceless for this sort of thing. It's very easy to use its REST interface and enables a completely new way of interacting with the TAC. I currently have a solution where I am using the ESB functionality to dynamically interact with the TAC in real time using this. Even if you do not use the ESB in the end, the MetaServlet may well help with some of the problems you are experiencing.
Thanks for the responses...
Regarding the original error, we found the cause. Apparently the TAC repositories for both PROD and TEST were on the same SQL Server. Not sure how that happened or got overlooked, but we moved them to there respective environment DB's. Apparently the DB the two repos were on was more of a sandbox type of environment and is somewhat volatile. Not exactly where you want your TAC DB residing...
As to the current architecture, I agree for the most part, but certainly would not call it the wrong tool for the solution. Talend (now that it is configured correctly), handles this easily with no issues. The job server memory looks surprisingly good over the course of a 23 hour run. Aside from the one issue that ended being a DB issue, it has been rock solid.