Understanding and using performance metrics in Talend ESB Web Services

Scenario

This article demonstrates how to monitor the performance of Web Services deployed in Talend Runtime. It explains the statistics generated by Talend Web Services and how to leverage the metrics using JMX and Nagios to improve the Quality of the Service.

 

Background

The Metrics feature of CXF management module provides aggregate statistics for Web Services running in the CXF Bus such as response times, endpoint status, and throughput. Understanding these metrics facilitates the development of high-performance Web Services, and on Production systems it finds problems at an early stage and troubleshoots them faster.

 

Prerequisites

  1. Linux distribution—this example uses Ubuntu 17.10

  2. Active Internet connection — to download dependencies and installation packages

  3. Talend ESB — installed

  4. DemoService ESB — demo sample in Studio — is deployed successfully

  5. JConsole — for monitoring the metrics using JMX

  6. Nagios installed and configured with ESB templates for monitoring (optional)

 

Monitoring Web Services using Nagios (optional)

Nagios is an open source monitoring solution that allows users to identify infrastructure problems before they effect important business processes. Nagios monitors the entire IT infrastructure to ensure services, applications, and business processes are working as expected. Talend ESB can also be monitored using Nagios.

 

Prerequisites for monitoring Talend ESB with Nagios

  1. Install Nagios and Jmx4Perl.

  2. Configure the CXF template files, cxf.cfg and cxf-host.cfg, jmx_commands.cfg, delivered with Talend Runtime in the directory TalendRuntimePath/add-ons/adapters/nagios, see Talend ESB Nagios configuration template files.

Note: The CXF template files, located in the cxf_templates_nagios.zip file attached to this article, are customized for this demonstration. If you use these templates, modify the Critical and Warning status parameters to display your business needs.

 

Host monitoring using Nagios

Monitor host_esb to display the statistics of the host, for example, host status and up-time.

Host_Monitoring.jpgMonitoring Talend Runtime Server using Nagios

 

Monitoring Web Services using JMX

  1. If you are not using a monitoring solution like Nagios, you can monitor the JMX metrics using JConsole.

  2. If your Talend Runtime is installed on a remote server, modify the settings in the container/etc/SERVICE_NAME-wrapper.conf file, to enable JMX monitoring. For example, Talend-Runtime-wrapper.conf or karaf-wrapper.conf:

    # Uncomment or add below lines to enable JMX
    wrapper.java.additional.10=-Dcom.sun.management.jmxremote.port=1616
    wrapper.java.additional.11=-Dcom.sun.management.jmxremote.authenticate=false
    wrapper.java.additional.12=-Dcom.sun.management.jmxremote.ssl=false
    wrapper.java.additional.13=-Djava.rmi.server.hostname=HOST_NAME or IP_ADDRESS_OF_RUNTIME
  3. Open JConsole, and navigate to MBeans > Metrics.Server.

    Jmx_Metrics_Attibutes.jpgMetrics MBeans for DemoService

 

Using the Metrics feature in Talend ESB

The Metrics feature in Talend ESB is implemented using the Codahale metrics library, and provides several commonly used metric types such as Meter, Counter, Timer, and Histogram, to output metrics values. Before analyzing these metrics, you need to know some basics about the metric types.

 

  • Meter: measures event occurrences count and rate. A meter metric measures Mean throughput and one-, five-, and fifteen-minute exponentially-weighted moving average throughputs.

    Meter.jpgMBeans type Meter and its attributes

    Average Rates

    • nMinuteRate: returns the last n-minute exponentially-weighted moving average rate at which events have occurred since the meter was created. For example, where n = 1, 5, 15.

    • MeanRate: rate at which events have occurred after the meter is started.

     

    An example from Wikipedia:

    Comparison of common averages of values { 1, 2, 2, 3, 4, 7, 9 }

    Type

    Description

    Example

    Result

    Mean

    Sum of values of a data set divided by number of values\scriptstyle {\bar  {x}}={\frac  {1}{n}}\sum _{{i=1}}^{n}x_{i}

    (1+2+2+3+4+7+9) / 7

    4

 

  • Counter: records incrementations and decrementations. For example, counting the number of DemoService invocations.

    Counter.jpgMBeans type Counter and its attributes

  • Timer: provides the duration statistics. It aggregates Min, Mean, and Max durations since the start of the meter. It encapsulates and gathers throughput statistics from Meter, response statistics from Histogram, and statistics from Counter.

    Timer.jpgMBeans type Timer and its attributes

  • Histogram: keeps track of a stream of long values, and analyzes their statistical characteristics such as Max, Min, Mean, Median, standard deviation, 75th percentile, and 99th percentile. Statistics generated by Histogram can be retrieved using a Timer Object as Snapshots.

    • Mean: is the average response time an application takes to process a request.

    • Max: is the maximum response time an application takes to process a request from a user.

    • Min: is the minimum response time an application takes to return a request to a user.

    • Percentiles: are useful for giving the relative standing that is, how one particular data value compares to the rest of the data. The Timer Object also provides metrics for 50th, 75th, 95th, and 98th percentile scores.

    For example: A service after running continuously for a few days. The average response time for a service could be low for example, 1 to 2 secs. However, the Max response time may go up, for example, 60 secs depending on the health of the server, network, increase in load, or the number of concurrent users.

Choosing which metrics to monitor for Web Services

Web Services performance can be measured in terms of throughput, response times, latency, availability, and many other metrics. Higher throughput, lower latency, low response times, low error rate, and high availability are some of the characteristics every application should have.

  • Throughput: is the total number of transactions processed by the server in a given time. The time is calculated from the start of the first sample to the end of the last sample. Throughput is also measured in terms of the number of bytes exchanged per second.

  • Response Time: is the time taken for processing at the server before it returns.

  • Latency: is the additional delay involved for a processed request to reach the remote client. Latency increases if the network quality decreases.

  • Faults and Errors: must be in an acceptable limit and might increase with network issues and high load.

  • Availability: measures if the Endpoint is reachable.

 

Measure MBeans Type Metric Attributes Statistics to monitor

Throughput

(events/sec)

type=Metrics.Server

Attribute=Totals

MeanRate

OneMinuteRate

FiveMinuteRate

FifteenMinuteRate

MeanRate: measures throughput from the start of the Timer, and may not represent the actual current load.

nMinuteRate: measures the moving average throughput for the last n-minute. The nMinuteRate is useful to understand the real time load on the system for last 1, 5, and 15 minutes respectively.

Response Times

(millisecs)

type=Metrics.Server

Attribute=Totals

Min

Max

Mean

Percentiles

Observes the deviations between the Min and Max response times.

Percentiles can also be monitored if Average measurements don't reflect actual load conditions.

In Flight orders

(count)

type=Metrics.Server

Attribute=Totals

Count

Returns count of pending requests. If the count is increasing, either the load on the system is increasing, or the server needs performance tuning.

Availability

type = Bus.Service.Endpoint

State

Endpoint is reachable if it is in the STARTED state.

Errors

(count)

type=Metrics.Server

 

Attribute=Checked Application Faults

Attribute=Logical Runtime Faults

Attribute=Runtime Faults

Attribute=Unchecked Application Faults

Count Returns count of checked, unchecked, logical, and Runtime Faults.

Number of Invocations

(count)

type=Metrics.Server

Attribute=Totals

Count

Number of requests processed after the service is started.

 

Performing a Load test

To get a better understanding of the performance of your Web Service, and to understand the Metric attributes, you should perform Load tests. Perform Load tests using many concurrent users for a minimum duration of 1 or 2 minutes, depending on your business requirements. Jmeter or SoapUI are popular for performing Load tests of Web Services.

 

Test Summary

Server: Talend Runtime is rebooted to ensure the metrics set to a default value 0.

Service: DemoService is deployed in Karaf.

Nagios: Metric refresh interval is 90 secs (open source)

Jmeter settings:

  • Uses Web Service template
  • Number of samples: 10,000
  • Number of users: 25
  • Loop count: 400
  • Ramp-up period: 1 sec

Goal

  • Observe the statistics generated by the metrics in multiple steps, and understand how the readings change with and without a Load on the server.

  • Capture the screen shots in both JConsole and Nagios.

Test 1: Statistics generated one minute after triggering the Load test.

Note: The Load test is initiated, and the container is under a load of 10,000 requests.

Test_10000_records_1min.jpgTest_Nagios_1min.jpg

 

Throughput

  • 1minuteRate, 5minuteRate, MeanRate: 180 transactions/sec
  • Note: Observe the warning message in Nagios console (high load 10000 transactions)

Response Times

  • Min - 1.73 ms
  • Max - 304 ms
  • Mean - 61 ms
  • 99th Percentile - 175 ms

NumOfInvocations

  • 10000
  • No errors are reported, and the Endpoint state is Green.

Conclusion:

  • nMinuteRate and the Max response time increased to a new high level.

  • In production scenarios, the InFlight or Pending orders should be closely monitored.

 

Test 2: Statistics generated two minutes after triggering Load test.

Note: Load test is complete, and the container is idle.

Test_10000_3_jmx.jpgNagios_Results_2mins.jpg

 

Throughput

  • 1minuteRate: 11 transactions/sec, Warning signal in Nagios disappears.
  • 5minuteRate, MeanRate also reduced to 50 to 100 transactions/sec.

Response Times

  • Min - 1.73 ms
  • Max - 304 ms
  • Mean - 61 ms
  • 99thPercentile - 175 ms

NumOfInvocations

  • 10000
  • No errors are reported, and the Endpoint state is Green.

Conclusion:

  • The nMinuteRate readings reflect the real time load on the system, and the Max response time remained the same.

  • The Max and Min response times help to analyze and read timeouts.

  • The Count from the InFlight and Error metric attributes can identify increasing load or performance issues.

 

Frequently asked questions

  1. Question: Can I monitor Talend ESB Rest services?

    Answer: Yes. CXF templates can be used for both SOAP and REST Data Services.

     

  2. Question: Can I monitor Talend ESB Camel Routes?

    Answer: Yes. Camel templates must be used for Routes using cSOAP and cREST components. For more information, see the Camel metrics templates.

     

  3. Question: Why is the Nagios console displaying an alert or warning message?

    Answer: The CXF template used in this demo is configured to display a critical alert if the number of transactions/sec crosses 100, and a warning alert for more than 50 transactions/sec. So, the color for oneMinuteRate fluctuates between green, pink, and yellow.

    <Check OneMinuteRate>
     MBean = org.apache.cxf:bus.id=*,type=Metrics.Server,service="$1",port="$0",Attribute=Totals
     Attribute = OneMinuteRate
     Name = OneMinuteRate
     #no of events per sec
     Critical 100
     Warning 50
    </Check>

     

  4. Question: What other monitoring solutions are supported by Talend ESB?

    Answer: Talend supports monitoring using JMX protocol, and any vendor API that can query JMX complying to standard security policies.

Version history
Revision #:
30 of 30
Last update:
‎07-05-2019 03:30 AM
Updated by: