The Idea of Synthetic Monitoring

Introduction

Synthetic monitoring is a proactive strategy used to ensure the reliability, performance, and availability of systems by generating synthetic (artificial) traffic. It relies on scripted scenarios that mimic user interactions with your system. These scripted tests are run at regular intervals (e.g., every minute, five minutes, or hourly). The frequency of these tests depends on the criticality of the monitored paths and the desired level of monitoring granularity.

These scripted tests typically store data as metrics using traditional tools like Prometheus. By leveraging this data, we can analyze traffic patterns, identify early signs of issues, and configure additional alerts to notify us of any potential problems.

Synthetic monitoring does not replace traditional monitoring, which gathers data from real user traffic. Instead, it is a complementary approach that enhances your overall monitoring strategy. While conventional monitoring examines how the system behaves from an internal perspective, synthetic monitoring provides measurements from an external viewpoint (treating the system under observation as a black box).

The main benefits of this approach are:

Proactive Issue Detection: Synthetic monitoring can detect issues before they impact real users by continuously running tests that simulate critical user interactions, helping you catch and fix problems early.
Continuous Monitoring: Synthetic monitoring ensures your system is being tested and monitored even during periods of low or no real traffic. This is particularly useful outside of peak hours or during scheduled maintenance windows.
Testing Critical Paths: You can focus on key user journeys or business processes, ensuring that the most important parts of your system are always functioning correctly.
Alerting for Low or Seasonal Traffic Services: In scenarios where real traffic is disrupted or absent (e.g., a sudden drop in user activity or service with seasonal traffic), synthetic monitoring ensures that your system remains functional.

In the context of modern distributed systems, where failures can occur in many components, synthetic monitoring resembles a form of continuous testing. It is tough to perform a robust test for functionality that touches several components. This does not mean that we should abandon traditional testing. It is a great approach to proactively find issues in your codebase. However, it can sometimes be limited. Synthetic monitoring can be great supplementary technique to quickly detect problems that evade standard testing methods and regular observability approaches.

In this post, I would like to show the process of building a basic synthetic monitoring service in Python that uses Prometheus to collect metrics.

High-level Architecture

The high-level architecture consists of a simple reservations service built in Python with a REST API exposed, and an independently running monitor written in Python that periodically calls the POST /reservations endpoint. Both services are integrated with Prometheus, which collects the metrics.

Prometheus Setup

To collect metrics from our services we need to configure the Prometheus instance. I presented the details of this process in one of the previous posts.

The first step is to define the Prometheus config file (prometheus.yml):

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 10s
scrape_configs:
  - job_name: "prometheus"
    scrape_interval: 5s
    static_configs:
      - targets: ["localhost:9090"]

At this point, the Prometheus instance will be scraping metrics only from itself (localhost:9090). When the Reservations Service and Synthetic Monitor are ready, we will have to update scrape_configs section to provide data from these services.

The Prometheus instance can be started using the Docker Compose file (docker-compose.yml):

# docker-compose.yml
version: "3.7"

services:
  prometheus:
    image: prom/prometheus
    ports:
      - target: 9090
        published: 9090
    volumes:
      - type: bind
        source: ./prometheus.yml
        target: /etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

We expose 9090 port to have access to Prometheus UI. The configuration file (prometheus.yml) which was defined several paragraphs above has to be mounted into the container using volumes section.

After hitting docker compose up in a terminal, we should see several Prometheus logs and information that the Prometheus container was created:

$ docker compose up
 ✔ Container synthetic-monitoring-prometheus-1  Created

We can visit localhost:9090/targets and check whether our scraping target (Prometheus instance) is recognized.

Everything looks ok, the Prometheus instance (prometheus) is up and ready as a scraping target. We can go to the next step: implementing Reservations Service and connecting it to the running Prometheus instance.

Service Under Observation

Reservations Service is a simple application written in Python Flask framework. It exposes a single HTTP endpoint: /reservations.

# reservations-service/app.py
import random
import time

import structlog
from flask import Flask, request
from prometheus_client import Histogram, make_wsgi_app
from werkzeug import Response
from werkzeug.middleware.dispatcher import DispatcherMiddleware

app = Flask(__name__)
app.wsgi_app = DispatcherMiddleware(app.wsgi_app, {"/metrics": make_wsgi_app()})

import prometheus_client

# Disable unnecessary metrics
prometheus_client.REGISTRY.unregister(prometheus_client.GC_COLLECTOR)
prometheus_client.REGISTRY.unregister(prometheus_client.PLATFORM_COLLECTOR)
prometheus_client.REGISTRY.unregister(prometheus_client.PROCESS_COLLECTOR)

logger = structlog.get_logger()

HTTP_REQUEST_DURATION = Histogram(
    "http_request_duration",
    "Requests durations",
    ["method", "url", "code"],
    buckets=[0.01, 0.1, 0.5, 2, float("inf")],
)

def observe_http(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        response = func(*args, **kwargs)
        end = time.time()
        HTTP_REQUEST_DURATION.labels(
            method=request.method,
            code=response.status_code,
            url=request.url,
        ).observe(end - start)
        return response

    return wrapper

@app.route("/reservations", methods=["POST"])
@observe_http
def reservations():
    logger.info("Reservation request", data=request.data)
    random_duration = (
        random.choice([1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3])
        * random.randint(1, 100)
        * 0.001
    )
    time.sleep(random_duration)

    response_code = random.choice(
        [
            200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
            400, 401, 500
        ]
    )
    logger.info("Reservation response", code=response_code)
    return Response(str(response_code), status=response_code)

The implementation of the endpoint is not important for our topic. We only mimic the behavior using random duration times and response codes. The endpoint is wrapped intoobserve_http decorator that gathers http_request_duration histogram metric. We use prometheus_client library to expose stored metrics in /metrics endpoint.

We define a Dockerfile within reservations-service folder. It will be used to create an application image.

# reservations-service/Dockerfile
FROM python:3.9-alpine
COPY app.py requirements.txt /
RUN pip install -r requirements.txt
EXPOSE 8000
CMD ["flask", "run", "-h", "0.0.0.0", "-p", "8000"]

To build the image, we need to run within reservations-service folder following command:

$ docker build . reservations-service:v0 ./

Now, the service artifact is created and can be added to docker-compose.yml file.

# docker-compose.yml
version: "3.7"

services:
  ...

  reservations-service:
    image: reservations-service:v0
    ports:
      - target: 8000
        published: 8080

The prometheus.yml file has to be updated with a new entry to gather metrics from Reservations Service:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 10s
rule_files:
  - "prometheus-rules.yml"
scrape_configs:
  - job_name: "prometheus"
    scrape_interval: 5s
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "reservations-service"
    scrape_interval: 5s
    static_configs:
     - targets: ["reservations-service:8000"]

We have to use the hostname that was declared for Reservations Service in the docker-compose.yml . In our case it is reservations-service, so the target address will be reservations-service:8000 . We can restart our Docker Compose and check whether the new target is visible in the Prometheus UI.

Reservations Service /metrics endpoint is visible for the Prometheus instance.

Now, we can hit /reservations endpoint several times, and check whether the metrics are collected.

$ curl -X POST http://localhost:8080/reservations

The Prometheus UI is not the greatest tool to visualize metrics. Nevertheless, using localhost:9090/graphs we can place a query (sum(rate(http_request_duration_count[1m])) by (code, url, method)) and present a request rate per second for the /reservations endpoint grouped by the response code (the method and URL should be intact).

Synthetic Monitor

In the previous section, we manually invoked /reservations endpoint several times. The idea of synthetic monitoring is to automate the process of probing an endpoint and storing the results as metrics.

Synthetic Monitor can be a different process or application that runs independently of other services. In our example, we implement the monitor as a Python service.

# monotor/app.py
import os
import time
import threading

import prometheus_client
import structlog
import requests
from flask import Flask
from prometheus_client import Histogram, make_wsgi_app
from werkzeug.middleware.dispatcher import DispatcherMiddleware

logger = structlog.get_logger()

app = Flask(__name__)

# Add endpoint that will expose the metrics
app.wsgi_app = DispatcherMiddleware(app.wsgi_app, {"/metrics": make_wsgi_app()})

prometheus_client.REGISTRY.unregister(prometheus_client.GC_COLLECTOR)
prometheus_client.REGISTRY.unregister(prometheus_client.PLATFORM_COLLECTOR)
prometheus_client.REGISTRY.unregister(prometheus_client.PROCESS_COLLECTOR)

INTERVAL_SEC = int(os.getenv("INTERVAL_SEC", 5))
RESERVATIONS_SERVICE_URL = os.getenv("RESERVATIONS_SERVICE_URL")
if not RESERVATIONS_SERVICE_URL:
    raise ValueError("RESERVATIONS_SERVICE_URL is not set")

HTTP_REQUEST_DURATION = Histogram(
    "synthetic_request_duration",
    "Synthetic requests durations",
    ["method", "url", "result"],
    buckets=[0.01, 0.1, 0.5, 2, float("inf")],
)

logger.info("Monitor")

def start_monitor():
    logger.info("Starting monitor")
    while True:
        _make_request()     
        time.sleep(INTERVAL_SEC)

    logger.info("Monitor terminated")

def _make_request():
    logger.info("Sending request to reservations service")
    endpoint = f"{RESERVATIONS_SERVICE_URL}/reservations"
    result = 'success'
    start = time.time()
    try:
        response = requests.post(endpoint, data='{"username": "synthetic"}')
        if response.status_code != 200:
            result = 'failure'
    except Exception as e:
        logger.warn(f"Error: {e}")
        result = 'failure'
    finally:
        end = time.time()
        HTTP_REQUEST_DURATION.labels(
            method='POST',
            url=endpoint,
            result=result
        ).observe(end - start)

# Start the monitor in a separate thread in the background
t = threading.Thread(target=start_monitor)
t.start()

The monitor is an infinite loop, that hits a desired endpoint (defined by RESERVATION_SERVICE_URL environment variable) at predefined intervals (INTERVAL_SEC). The endpoint response is analyzed and stored in the Prometheus histogram metric synthetic_request_duration. Instead of saving the response code, for each call we determine result value. The success is only if we get 200 response. For both client errors (4..) and internal server errors (5..) the result value is failure.

We used the Flask framework to set up an HTTP server for exposing Prometheus metrics from the monitor.

The monitor is run in a separate background thread (threading.Thread(target=start_monitor) to avoid blocking the HTTP server.

To build a service image, we defined a Dockerfile, similar to the one created for Reservation Service.

# monitor/Dockerfile
FROM python:3.9-alpine
COPY app.py requirements.txt /
RUN pip install -r requirements.txt
EXPOSE 8000
CMD ["flask", "run", "-h", "0.0.0.0", "-p", "8000"]

The build command should be run within monitor folder:

$ docker build -t monitor:v0 ./

To start the monitor, we need to add it to docker-compose.yml file:

# docker-compose.yml
version: "3.7"

services:
  ...

  reservations-service:
    image: reservations-service:v0
    ports:
      - target: 8000
        published: 8080

  monitor:
    image: monitor:v0
    ports:
      - target: 8000
        published: 8081
    environment:
      - RESERVATIONS_SERVICE_URL=http://reservations-service:8000
      - INTERVAL_SEC=5

The last step for the monitor is to add it as a target for the Prometheus instance.

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 10s
scrape_configs:
  - job_name: "prometheus"
    scrape_interval: 5s
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "reservations-service"
    scrape_interval: 5s
    static_configs:
      - targets: ["reservations-service:8000"]
  - job_name: "monitor"
    scrape_interval: 5s
    static_configs:
      - targets: ["monitor:8000"]

As for Reservations Service, we use the same hostname we defined in the docker-compose.yml file for Synthetic Monitor: monitor.

After restarting Docker Compose, the monitor will periodically call Reservations Service, and we should be able to collect metrics from both services.

Reservations Service

Synthetic Monitor

We can also examine the logs to verify whether the services are communicating with each other as expected.

reservations-service-1  | 192.168.80.4 - - [12/Dec/2024 10:22:05] "POST /reservations HTTP/1.1" 200 -
monitor-1               | 2024-12-12 10:22:10 [info     ] Sending request to reservations service
reservations-service-1  | 2024-12-12 10:22:10 [info     ] Reservation request            data=b'{"username": "synthetic"}'
reservations-service-1  | 2024-12-12 10:22:10 [info     ] Reservation response           code=500
reservations-service-1  | 192.168.80.4 - - [12/Dec/2024 10:22:10] "POST /reservations HTTP/1.1" 500 -
monitor-1               | 192.168.80.3 - - [12/Dec/2024 10:22:12] "GET /metrics HTTP/1.1" 200 -
reservations-service-1  | 192.168.80.3 - - [12/Dec/2024 10:22:15] "GET /metrics HTTP/1.1" 200 -
monitor-1               | 2024-12-12 10:22:15 [info     ] Sending request to reservations service
reservations-service-1  | 2024-12-12 10:22:15 [info     ] Reservation request            data=b'{"username": "synthetic"}'
reservations-service-1  | 2024-12-12 10:22:15 [info     ] Reservation response           code=200

Beyond logs showing communication between services, we can also observe logs indicating that Prometheus is successfully reaching the /metrics endpoint: "GET /metrics HTTP/1.1" 200.

Alerting

We collect metrics from both Reservations Service and Synthetic Monitor and while they are quite similar, they can be utilized in different ways with distinct alerting strategies.

For Reservations Service, we treat internal server errors (such as 500 errors) as failures, as they indicate issues within the service itself.

However, for Synthetic Monitoring, we treat even client errors (such as 400, 401, 403, and 404) as failures (not only 5xx errors). This is because we have full knowledge of the conditions and context of the request. For instance:

404 Errors: We know that the user exists in the database, so a 404 error (resource not found) should never occur unless there is an issue with database retrieval.
401 and 403 Errors: Since we pass the correct credentials, these errors should never happen unless there's an authentication issue.
400 Errors: As long as we send well-formed requests, a 400 error (bad request) should not occur. If it does, it suggests that there is a problem with request validation.

Last but not least, in a situation where our /reservations endpoint ceased functioning (for example, because we were using a URL mapping that stopped working), synthetic monitoring can also detect the problem that traditional monitoring misses. In that scenario, synthetic monitoring will alert as it is unable to access a service, but internal monitoring won't since there won't be any traffic.

To set up different alerting strategies for Reservations Service and Monitor, we can define rules in prometheus-rules.yml file.

groups:
  - name: reservations-service
    rules:
      - alert: Synthetic errors rate exceeds 10%
        for: 30s
        expr:  sum(rate(synthetic_request_duration_count{result=~"failure"}[1m]))/ sum(rate(synthetic_request_duration_count[1m])) * 100 > 10
        labels:
          severity: critical

      - alert: Errors rate exceeds 10%
        for: 30s
        expr:   sum(rate(http_request_duration_count{code=~"5.."}[1m])) / sum(rate(http_request_duration_count[1m])) * 100 > 10
        labels:
          severity: critical

The criteria for both alerts is the same: 10% of errors for at least 30 seconds should result in an alert. However, because they define error differently (for the Monitor, an error is any response code that is different than 200, but for Reservations Service, only requests that end in 5xx codes are considered errors) the alerts' sensitivity varies.

The rules will be enabled when we mount prometheus-rules.yml files into the correct location in the Prometheus instance

# docker-compose.yml
version: "3.7"

services:
  prometheus:
    image: prom/prometheus
    ports:
      - target: 9090
        published: 9090
    volumes:
      - type: bind
        source: ./prometheus.yml
        target: /etc/prometheus/prometheus.yml
      - type: bind
        source: ./prometheus-rules.yml
        target: /etc/prometheus/rules.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

In the Prometheus config file prometheus.yml we also have to add a reference to the rules file. It is mounted as /etc/prometheus/rules.yml on the Prometheus instance and we only need to pass relative path name rules.yml in rule_files section.

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 10s
rule_files:
  - "rules.yml"
scrape_configs:
  - job_name: "prometheus"
    scrape_interval: 5s
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "reservations-service"
    scrape_interval: 5s
    static_configs:
      - targets: ["reservations-service:8000"]
  - job_name: "monitor"
    scrape_interval: 5s
    static_configs:
      - targets: ["monitor:8000"]

Now, we can observe on localhost:9090/alerts page that we genuinely have situations when only the synthetic monitoring alert is firing.

Conclusion

In this post, we demonstrated how to utilize synthetic monitoring as a complementary tool alongside traditional monitoring and testing to enhance system reliability. We walked through the implementation of a simple synthetic monitor and highlighted how this service can help identify issues in situations where other indicators might remain silent. By simulating user behavior and observing the system from an external perspective, synthetic monitoring provides proactive insights that traditional methods may miss, ensuring that potential problems are detected and addressed early.

The codebase covering the topic can be found here.