Four Stars

[OracleConnection] Issue with 24/7 Webservice

Hi everyone,

 

I'm facing a quite strange issue on Talend OS ESB 6.1.1 when using Webservices in "Keep Listening" mode.

The problem is not related to Oracle nor Firewall / Hardware / Communication issue. Here is what I've came with so far. If you're not ready to read a full analysis you can ignore my post Smiley Very Happy .

We investigated for around 10 full days of work before coming to these conclusions and questions

 

What is done :

 

In order to have a "safe" Webservice, we protected each and every Oracle connector with "On Component Error" going to a sequence closing and re-opening connections. So if the Database fails to answer to a request or simply crashes, it renegotiates all connections. It also prevents the system from forced kill coming from Oracle (just in case).

You can find images in my original post (I didn't found it here) : https://www.talendforge.org/forum/viewtopic.php?id=54471

 

What is tested :

 

We identified three different scenarios  : but why 3 you may ask ?

The problem is that OracleConnection components don't create a single connection, but a non-controllable number of sessions with the jdbc connector to - I assume - load balance workload.

So here are the three options we detected :

  • One of the channel is killed (kill immediate in Oracle // equivalent of kill -9 if you're familiar with shell)
  • The whole Database or node crashes on an error such as :
    • database memory problem (PGA/SGA)
    • disk usage full
    • hardware failure
  • A node of the Database is properly shut down (we have 2 nodes, but you can go for less or more complicated Database, it doesn't matter here)

 

In the first case scenario, considering there's 5 channels opened, Talend is currently using the n°2 and we kill the channel n°4 to trigger the event.

Talend will use normally channel 2 and 3, but will trigger immediately "OnComponent Error" on 4th channel. This will trigger in my system a closing of all current OracleConnection Component and re-open all connection just after (first image in my link).

For your information, I'm using "On component Error" because "On Subjob Error" is not triggered immediately. In my previous scenario, it would have waited for channel 5 and channel 1 status. After that, it sums up the state of the connector :1 of 5 of the channels are down and only now it triggers ==> "On Subjob Error" is triggered, which is often too late.

 

This one have been tested so many times that I can't anymore.

 

Second case scenario is quite the same. The only difference is that it triggers "OnComponent Error" as soon as the Database fails to respond. It's quite understandable.

I'll not re-explain the same scenario twice Smiley Wink

 

Last but not least : the proper shutdown.

To ensure that you understand Oracle configuration, we have an Oracle 11gR2 last version on an ODA (Oracle Data Appliance) with 2 nodes and 2 service (so potentially 4 ways). All that mess of Oracle is accessed by a SCAN address, which is like having one and only name like "DB1". Then you ask for a service "SERV1" or "SERV2" for Production and Non-Production. Each of these services can use the performance of both nodes thanks to the ODA (using Oracle GRID).

Sum up : we have one simple name to address the database, and nothing in the middleware is in cause.

 

Now our real problem : When you shut down a node properly, for maintenance for example, Talend doesn't detect it as a failure, and continue to send request to this channel.

It seems to consider that Oracle is just "late" and is not taking any consideration about request failure.

 

I'm currently controlling all errors triggered by Talend, and out of solutions to catch this one. And most of all, I wonder why it's acting like that....

 

If you survived my whole post, congratulations, and thank you for reading, hoping that you'll be able to help me !!

 

 

Sincerely,

3 REPLIES
Moderator

Re: [OracleConnection] Issue with 24/7 Webservice

Hello,

How can we repro your issue step by step? Could you please create a jira issue on talend bug tracker?

https://jira.talendforge.org/secure/Dashboard.jspa

 

Best regards

Sabrina

--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Four Stars

Re: [OracleConnection] Issue with 24/7 Webservice

Hi Sabrina, and sorry for the delay.

 

I can open a JIRA issue, but this is really long and complex to reproduce mainly because of these two issues :

  • We have an Oracle RAC Database under SCAN (this is the Qual and Prod environment) and a Simple Oracle 11gR2 in Dev : so we need to chose a structure
  • You need to build the whole project and create complex Database schema to reproduce our problem (3 schema involved)

 

So, I could take the time to open an issue on JIRA, it's up to me (and only for my interest ^^) but I wonder if you'll be able to reproduce the same situation "exactly".

I'll try to build a simpler pattern, easily reproducible by your teams, to highlight the problem.

 

Question : is there any other things I need to provide when opening the issue ?

 

I'll update this post with the issue with JIRA link, and it updated.

 

 

Sincerely,

 

Four Stars

Re: [OracleConnection] Issue with 24/7 Webservice

Hello,

 

Update on the subject !!

 

I managed to create a small Webservice in Keep-listening mode, with my pattern :

Job after Oracle Session kill and a call on the WebserviceJob after Oracle Session kill and a call on the Webservice

As you can see, as soon as the flow reaches the tMap, it brakes into an error and cutting the flow, but is not saying it's a "Subjob Error". It's considered as a Component Error only.

If you try to close the connection, I put an "OnSubjobOk", because, when it opens many connections, only one is breaking, and you need to close the other ones. Otherwise they will stay alive and occupy processes on the DB. I wasn't able to reproduce the multi-channel problem by the way.

 

The Provider Request seems to override any error not coming from himself, which is ... problematic, because it expands to the whole job, and not only the first subjob.

 

I have many other things like if you put 3 tMap after the first one, and doing a Lookup, they won't break, because the flow didn't reach them ^^

Same thing with a tReplicate. It goes top to bottom, but stops at the first error encounter...

 

Sabrina, I'll open this Thursday a JIRA issue, but I think you can reproduce this one easily and maybe find something, and add it to the JIRA issue. I'll provide you the link.

 

Sincerely,