Ninja Van’s monitoring stack

Published in

Ninja Van Tech

7 min readSep 7, 2022

Summary

As a major player in the Southeast Asian logistics field, Ninja Van delivers millions of parcels daily across six different countries. In order to ensure the smooth running of our proprietary ecosystem of logistics microservices, we track many different metrics over the entire delivery lifecycle of each parcel. This easily translates to trillions of data points being collected daily. All these metrics then need to be displayed on Grafana dashboards for engineers to monitor their services. With so much data, it becomes a challenge to guarantee the performance of these dashboards, particularly for queries with long time ranges.

In this article, I will be going through the technology stack we use to monitor our services, and how we maintain swift query performance despite our data volume. Throughout this article, I will be using a real world example of our Apdex calculations. Our Apdex calculation comprises a number of long-range queries, making it the perfect fit for this topic.

Monitoring technology stack

The diagram above is a simplified overview of our monitoring stack. We will be going through how most of the components work with one another and how we use it to store, query and visualise the trillions of data we deal with everyday.

Istio

A microservice architecture greatly benefits from the use of a service mesh like Istio. A service mesh controls service to service communication decoupling networking logic from the application.

One such benefit of a service mesh is observability. Istio provides direct integration with Prometheus out of the box, allowing our Prometheus servers to scrape the telemetry metrics captured by Istio. On top of the standard telemetry metrics, Istio allows us to have custom metrics.

Some useful telemetry metrics from istio:

istio_requests_count: Counter for every request handled by the istio proxy.
istio_request_duration_milliseconds: Distribution that measures duration of requests.

These Telemetry metrics scrapped from Istio form the basis of the Apdex calculation examples in later sections.

Prometheus

Prometheus is an open-source system for monitoring and alerting. It has the ability to scrape endpoints to obtain metrics from services and store it in a local time series database (TSDB), which is optimal for querying metrics. Moreover the service discovery mechanism of prometheus allows prometheus to dynamically scrape new targets automatically.

Prometheus recording rules

Recording rules are rules set within Prometheus configuration that instructs Prometheus to precompute the query and save it as a new series. Metrics that are produced this way are called federated metrics. This set of federated metrics can be used in place of the normal query to reduce compute duration and resource consumption used by Prometheus instances.

Looking at the Apdex example:

rules: |
  groups: 
    - name: "istio.metrics_aggregation"
      interval: 1m
      rules:
      - record: istio_request_duration_milliseconds_bucket:rate5m
        expr: sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (destination_workload_namespace, request_operation, destination_app, response_code, le, reporter)

In the above example, we have set to precompute the metric from Istio that logs the response duration of a request to a service. This query gets the sum of the per average rate of increase of the metric over 5 minutes. So now, instead of querying sum(rate(istio_request_duration_milliseconds_bucket[5m])), we will use the recording rule istio_request_duration_milliseconds_bucket:rate5m, which is the equivalent of the query, just that it has already been precomputed and stored as a new result. This allows for faster retrieval of the result.

Now you might be tempted to create a recording rule for every query you have. However, this should be treated with caution as it can easily grow to be hard to maintain and keep track of all the rules.

The recommended approach is to use only one range for all your recording rules. This not only keeps the management of your rules simple, but also allows for easy comparison across recording rules as we cannot directly compare rates with different ranges. It is also recommended to have a short range that is at least four times the length of your instance’s scrape interval. If a longer range is required, the avg_over_time function can be used to average out the performance over the longer range. For our Apdex example, we used a standardised 5m range recording rule, but performed an 1h avg_over_time to get the hourly Apdex performance as you will see in the following section.

Apdex calculation

We use the recording rule above to calculate the Apdex scores of our services. To calculate Apdex, we need 3 values: satisfied count, tolerated count and the total count of requests.

Satisfied count:

sum(avg_over_time(istio_request_duration_milliseconds_bucket:rate5m{destination_workload_namespace=~"prod", destination_app=~"app-name", request_operation=~"endpoint-1", response_code!~"5.*", le="100"}[1h]))

In the query above, we are getting the count of requests that are not a 500 response code and have a response time below 100ms for an endpoint in a service. Note that we also perform the 1 hour avg_over_time to translate the 5m recording rule to get hourly performance.

Tolerated count:

(sum(avg_over_time(istio_request_duration_milliseconds_bucket:rate5m{destination_workload_namespace=~"prod", destination_app=~"app-name", request_operation=~"endpoint-1", response_code!~"5.*", le="500"}[1h]))
-
sum(avg_over_time(istio_request_duration_milliseconds_bucket:rate5m{destination_workload_namespace=~"prod", destination_app=~"app-name", request_operation=~"endpoint-1", response_code!~"5.*", le="100"}[1h])))

In the query above, we are getting the count of requests that are not a 500 response code and have a response time between 100ms to 500ms.

Total count:

sum(avg_over_time(istio_request_duration_milliseconds_bucket:rate5m{destination_workload_namespace=~"prod", destination_app=~"app-name", request_operation=~"endpoint-1",le="+Inf"}[1h]))

In the query above, we are getting the total number of requests regardless of response code and response time.

Note: Since the result is an average over 1 hour, we have to multiply it by 3600 to get the value on the hour.

Query breakdown

To help break down the query, let’s have some context and use the example where I am trying to query the total count for a particular service between 14:00–16:00 (2 hours).

Recall the recording rule from earlier, which sums the rate of increase of the request duration over 5 minutes. The raw query for total count is something like this:

sum(rate(istio_request_duration_miliseconds_bucket{destination_workload_namespace=~"prod", destination_app=~"app-name", request_operation=~"endpoint-1", le="+Inf"}[5m]))

Explanation:

istio_request_duration_miliseconds_bucket : The metric that we want to query.

{destination_workload_namespace=~"prod", destination_app=~"app-name", request_operation=~"endpoint-1", le=”+Inf"} : Labels are key value pairs in the parenthesis after the metric. They serve to identify the app/endpoint to be queried for that metric.

rate (rate of increase) : The difference of the metric at the 5 minute mark and the 0 minute mark divided by 5 minutes. Rate also helps extrapolates the values for missed scrapes or target restarts.

When querying over 2 hours, the query will result in 25 samples returned:

24 steps, with 1 step every 5 min interval (120/5)
1 additional step for the current time

sum(avg_over_time(istio_request_duration_milliseconds_bucket:rate5m{destination_workload_namespace=~"prod", destination_app=~"app-name", request_operation=~"endpoint-1",le="+Inf"}[1h]))

Explanation:

avg_over_time : The average over an hour for each sample in the query. For this case, it will calculate the average value at 14:00 to 13:00 for the first sample, and 14:05 to 13:05 for the second sample and average their values, and so on.

Thanos

Prometheus by itself usually has a few downsides like not supporting high availability (HA) and long term storage of metrics.

This is where Thanos comes in. We use Thanos to federate data across all our Prometheus instances and store the data in Google Cloud Storage (GCS), our cloud storage system, for long term storage. The data stored can then be queried by the Thanos query layer, which is able to get the recent data directly from Prometheus, or for longer term queries, from GCS.

This integrates seamlessly with Prometheus as it uses PromQL, the same query language as Prometheus. Moreover, it is also able to retrieve the recording rules configured in Prometheus like the ones we specified above.

Grafana

Grafana is an open-source visualisation tool popularly used for tracking operational performances. Grafana is able to set Thanos as a data source allowing us to query Thanos directly to get the historical data that isn’t stored in Prometheus. We are able to query Thanos with PromQL to create the graphs.

In the case of Apdex, we have created Grafana dashboards to allow our service owners to easily monitor their applications’ Apdex scores.

Conclusion

To successfully monitor services we used multiple tools.

Istio: logs useful metrics of kubernetes pods.
Prometheus: stores metrics as a time series to be queried.
— Recording rules: Precomputes queries, improving query performance and reducing resource consumption.
Thanos: Federates Prometheus for long term retention of metrics and high availability.
Grafana: Helps makes sense of metrics by allowing us to create graphs to easily represent trends and statistics.

We hope you have learned how these tools have helped us and can be of use to you too.

Reference:

These articles helped us in deciding how to design and use the appropriate queries:

What range should I use with rate()? - Robust Perception | Prometheus Monitoring Experts

Choosing what range to use with the rate function can be a bit subtle. The general rule for choosing the range is that…

www.robustperception.io

Observability Best Practices

Documentation Operations Best Practices Observability Best Practices The recommended approach for production-scale…

istio.io

Understanding the Prometheus rate() function | MetricFire Blog

Both Prometheus and its querying language PromQL have quite a few functions for performing various calculations on the…

www.metricfire.com

Federated Prometheus to reduce Metric Cardinality

Prometheus How to reduce Istio metric cardinality following a migration to telemetry v2 - using Prometheus Federation…

karlstoney.com

Interested in building reliable and innovative products? We have good news! Ninja Van is hiring!! If you’re from Singapore, Indonesia or Vietnam, or are willing to relocate here, you can find more at our careers page!

P.S Special thanks to my team Luqi Chen & Timothy Ong 😄