This is an outage event:
* On Monday morning, I notice that we cannot connect to TAC server.
* Restart TAC server : /etc/init.d/talend-tac-6.3.1 restart
* All jobs didn't run for a day since Sunday.
* Notice CPU load on one job serve is high. There was an issue on this job server. Reboot the this job server, which fixed the issue.
* The cause:
1. Job server ONE had an issue;
2. This triggered the TAC serve hang;
In catalina.log, I have seen a lot of errors.
"11-Nov-2017 12:49:23.760 WARNING [pool-6-thread-1] org.drools.persistence.jta.JtaTransactionManager.commit Unable to commit transaction
bitronix.tm.internal.BitronixRollbackException: transaction timed out and has been rolled back
at org.drools.persistence.jta.JtaTransactionManager.commit(JtaTransactionManager.java:226) "
3. Then no jobs will be triggered even through I have other job servers.
* In my setup, TAC and job servers are on different independent servers, and the job servers are not in cluster mode. (BTW, I'm not sure which product includes cluster mode.)
We have that OS, DB level monitoring, but nothing pick up this issue. My question is in general, how to monitor the healthy of TAC server and job server ?
Have you already checked TalendHelpCenter:Talend Activity Monitoring to see if it is what you are looking for?