Worksprout | Home- Blog Details

How to build a complete observability stack for production machine learning systems — metrics with Prometheus, dashboards with Grafana, and log analytics with Elasticsearch, Logstash, and Kibana.

Why ML Systems Need Special Observability

Traditional software fails loudly — exceptions, non-200 status codes, process crashes. ML systems fail silently: the inference endpoint returns 200, latency looks normal, but the model's predictions have drifted and users are getting wrong answers. This requires an observability layer that goes beyond infrastructure metrics to capture model behaviour, data quality, and prediction semantics.

The Three Layers of ML Observability

Infrastructure — CPU, memory, GPU utilisation, request latency, error rates (Prometheus + Grafana)
Model performance — prediction distribution, feature drift, accuracy against delayed ground truth (custom Prometheus metrics)
Data and audit logs — structured logs of every prediction request for debugging and compliance (ELK stack)

Instrumenting Your Model Server

Add Prometheus metrics to your FastAPI model server:

from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import FastAPI, Response
import time

app = FastAPI()

REQUEST_COUNT = Counter("model_requests_total", "Total inference requests", ["model_version", "status"])
REQUEST_LATENCY = Histogram("model_latency_seconds", "Inference latency", ["model_version"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5])
PREDICTION_DIST = Counter("model_predictions_total", "Prediction distribution", ["label"])
MODEL_CONFIDENCE = Histogram("model_confidence", "Prediction confidence scores",
    buckets=[0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99])

@app.post("/predict")
async def predict(request: PredictRequest):
    start = time.time()
    try:
        result = model.infer(request.features)
        REQUEST_COUNT.labels(model_version=MODEL_VERSION, status="success").inc()
        PREDICTION_DIST.labels(label=result.label).inc()
        MODEL_CONFIDENCE.observe(result.confidence)
        return result
    except Exception as e:
        REQUEST_COUNT.labels(model_version=MODEL_VERSION, status="error").inc()
        raise
    finally:
        REQUEST_LATENCY.labels(model_version=MODEL_VERSION).observe(time.time() - start)

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type="text/plain")

Prometheus Configuration

Scrape your model server from Prometheus:

scrape_configs:
  - job_name: ml_model_servers
    scrape_interval: 15s
    static_configs:
      - targets:
          - anomaly-detector:8080
          - chatbot-api:8080
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

Grafana Dashboards for Model Health

Build a model health dashboard with four key panels:

Request rate: rate(model_requests_total[5m]) — shows traffic patterns and drops
Error rate: rate(model_requests_total{status="error"}[5m]) / rate(model_requests_total[5m])
Latency heatmap: histogram_quantile(0.99, rate(model_latency_seconds_bucket[5m]))
Prediction distribution over time — detect shifts in output class balance that signal data drift

Set Grafana alerts on p99 latency exceeding 500 ms or error rate exceeding 1%.

Structured Logging with the ELK Stack

Log every inference request as structured JSON for debugging and compliance:

import structlog

log = structlog.get_logger()

def predict_and_log(request_id: str, features: dict, model_version: str):
    result = model.infer(features)
    log.info("inference",
        request_id=request_id,
        model_version=model_version,
        input_hash=hash_features(features),
        prediction=result.label,
        confidence=result.confidence,
        latency_ms=result.latency_ms,
    )
    return result

Ship these logs via Filebeat to Logstash for enrichment, then index in Elasticsearch. Kibana's Lens visualisation lets you build ad-hoc queries like "all low-confidence predictions in the last hour for input features matching pattern X."

Data Drift Detection

Statistical drift detection at the feature level catches distribution shifts before they degrade accuracy. Compute PSI (Population Stability Index) on incoming feature distributions versus the training baseline, and alert when PSI exceeds 0.2:

def psi(expected: np.ndarray, actual: np.ndarray, buckets: int = 10) -> float:
    expected_pcts = np.histogram(expected, bins=buckets)[0] / len(expected)
    actual_pcts = np.histogram(actual, bins=buckets, range=(expected.min(), expected.max()))[0] / len(actual)
    psi_value = np.sum((actual_pcts - expected_pcts) * np.log((actual_pcts + 1e-8) / (expected_pcts + 1e-8)))
    return psi_value

Conclusion

A production ML system without observability is flying blind. Instrument your model servers from day one, build dashboards that surface model behaviour — not just infrastructure health — and invest in drift detection so you know when your model's world has changed. The operational cost of a silent model degradation vastly exceeds the engineering cost of building the monitoring upfront.

TECHNICAL BLOG

Deep Dives for Engineers

Observability for ML Systems: Prometheus, Grafana, and the ELK Stack

Why ML Systems Need Special Observability

The Three Layers of ML Observability

Instrumenting Your Model Server

Prometheus Configuration

Grafana Dashboards for Model Health

Structured Logging with the ELK Stack

Data Drift Detection

Conclusion

Worksprout Team

Related Posts

Anomaly Detection Systems: Catching Infrastructure Failures Before They Happen

Building Production AI Chatbots with LangChain, FastAPI, and RAG

Agentic AI: Designing Autonomous Multi-Agent Systems for Real-World Tasks