Four Stars

[OracleConnection] Issue with 24/7 Webservice

Hi everyone,

 

I'm facing a quite strange issue on Talend OS ESB 6.1.1 when using Webservices in "Keep Listening" mode.

The problem is not related to Oracle nor Firewall / Hardware / Communication issue. Here is what I've came with so far. If you're not ready to read a full analysis you can ignore my post Smiley Very Happy .

We investigated for around 10 full days of work before coming to these conclusions and questions

 

What is done :

 

In order to have a "safe" Webservice, we protected each and every Oracle connector with "On Component Error" going to a sequence closing and re-opening connections. So if the Database fails to answer to a request or simply crashes, it renegotiates all connections. It also prevents the system from forced kill coming from Oracle (just in case).

You can find images in my original post (I didn't found it here) : https://www.talendforge.org/forum/viewtopic.php?id=54471

 

What is tested :

 

We identified three different scenarios  : but why 3 you may ask ?

The problem is that OracleConnection components don't create a single connection, but a non-controllable number of sessions with the jdbc connector to - I assume - load balance workload.

So here are the three options we detected :

  • One of the channel is killed (kill immediate in Oracle // equivalent of kill -9 if you're familiar with shell)
  • The whole Database or node crashes on an error such as :
    • database memory problem (PGA/SGA)
    • disk usage full
    • hardware failure
  • A node of the Database is properly shut down (we have 2 nodes, but you can go for less or more complicated Database, it doesn't matter here)

 

In the first case scenario, considering there's 5 channels opened, Talend is currently using the n°2 and we kill the channel n°4 to trigger the event.

Talend will use normally channel 2 and 3, but will trigger immediately "OnComponent Error" on 4th channel. This will trigger in my system a closing of all current OracleConnection Component and re-open all connection just after (first image in my link).

For your information, I'm using "On component Error" because "On Subjob Error" is not triggered immediately. In my previous scenario, it would have waited for channel 5 and channel 1 status. After that, it sums up the state of the connector :1 of 5 of the channels are down and only now it triggers ==> "On Subjob Error" is triggered, which is often too late.

 

This one have been tested so many times that I can't anymore.

 

Second case scenario is quite the same. The only difference is that it triggers "OnComponent Error" as soon as the Database fails to respond. It's quite understandable.

I'll not re-explain the same scenario twice Smiley Wink

 

Last but not least : the proper shutdown.

To ensure that you understand Oracle configuration, we have an Oracle 11gR2 last version on an ODA (Oracle Data Appliance) with 2 nodes and 2 service (so potentially 4 ways). All that mess of Oracle is accessed by a SCAN address, which is like having one and only name like "DB1". Then you ask for a service "SERV1" or "SERV2" for Production and Non-Production. Each of these services can use the performance of both nodes thanks to the ODA (using Oracle GRID).

Sum up : we have one simple name to address the database, and nothing in the middleware is in cause.

 

Now our real problem : When you shut down a node properly, for maintenance for example, Talend doesn't detect it as a failure, and continue to send request to this channel.

It seems to consider that Oracle is just "late" and is not taking any consideration about request failure.

 

I'm currently controlling all errors triggered by Talend, and out of solutions to catch this one. And most of all, I wonder why it's acting like that....

 

If you survived my whole post, congratulations, and thank you for reading, hoping that you'll be able to help me !!

 

 

Sincerely,

  • ESB