Proactive Alerting using Service Mesh metrics

Aim

To query the Prometheus server of the service mesh installed to get real time values of the 4 golden signals .
Based on the values of these signals and user defined threshold , alert the user whenever a value crosses the threshold.


The 4 Golden Signals and their thresholds .

The 4 golden signals we are looking at are -

  • Request Rate - denotes the number of req/sec that your services are handling .

  • Error Rate - Denoted the number of error requests per second . Any request with a response status code of 5XX is being considered as error . This particular response code may depend on the application being monitored.

  • Latency - denotes the time taken to service a request . We are using P95 latency.
    It means that 95% of the requests should be faster than given latency threshold. In other words only 1% of the requests are allowed to be slower. The most relevant latency values used in the industry are - P99 , p95 , P90 and P50.

    In other words , P95 latency of 50ms would mean that 95% of requests took average 50ms.

  • Saturation - It denotes the load on the system , in terms of cpu percentage and memory .


Alerts have been implemented for only 3 out of 4 golden signals . Alerts for saturation are yet to be decided.

The Config File

Since servicemesh.json is located under .roostrde , the threshold config file was decided to be stored alongside that.

  • Local Roost Scenario - $HOME/.roostrde/threshold.json

  • Remote Roost Scenario - /root/.roostrde/threshold,json

The user can set the threshold values using the UI provided during service mesh installation

The implementation of this UI will be tweaked once the cluster management view is ready .

The threshold.json file stores the values in JSON format .

{ "requestRate": 20, "errorRate": 10, "latency": 5000, "saturation": 15, "interval_time": 1 }
  • The RoostApi endpoint that receives this data is /api/metricThresholds

  • This JSON structure is subject to change once we start supporting application specific thresholds. As of now , these values are system wide thresholds .

 

How to Use/Test the Alerts

For testing purposes we have tweaked our ballot image such that , any vote given to k3d returns a status code of 500 and this forms an error .
One can download the modified ballot image by running docker pull ashrr108/ballot:latest or by simply changing the image field in the ballot.yaml to ashrr108/ballot:latest .

 

Steps

  1. Install any service mesh and set appropriate thresholds.

  2. Once installed , inject any namespace using the namespace tab in the table view of Workload Analytics.

  3. Deploy the voter/ballot example in the injected namespace so that the service mesh automatically injects a proxy during pod creation .

  4. test the application normally by casting a few votes and you should be able to see the metrics being logged in /var/tmp/Roost/startRoostApi.log every 1 minute as that's the default query time .

  5. If the value of any golden signal crosses the set threshold , a alert will be triggered which will be posted in the Root Event Viewer .


You will be able to see the metrics in RoostApi logs as well .
2021/07/05 15:07:42 ROOSTAPI:asm_amd64.s:1374 goexit(): GoRoutine-29945: INFO: map[ErrorRate:100 Latency:1940.5 RequestRate:7.552380952380952]

 

RoadMap

  • Right now , the threshold values refer to system wide metrics. In the future we should support application specific thresholds.

  • Error Requests are requests with a response code of 5XX . This response code depends on the application and we may need to support configuring it based on user input .

  • We will also support Auto-Scaling of deployments when the request rate becomes too high or too low . In this scenario , the concerned deployment will be either scaled up or scaled down in order to keep the request rate under the threshold.

Â