TECHNICAL BLOG
Deep Dives for Engineers
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
How to build a complete observability stack for production machine learning systems — metrics with Prometheus, dashboards with Grafana, and log analytics with Elasticsearch, Logstash, and Kibana.
Traditional software fails loudly — exceptions, non-200 status codes, process crashes. ML systems fail silently: the inference endpoint returns 200, latency looks normal, but the model's predictions have drifted and users are getting wrong answers. This requires an observability layer that goes beyond infrastructure metrics to capture model behaviour, data quality, and prediction semantics.
Add Prometheus metrics to your FastAPI model server:
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import FastAPI, Response
import time
app = FastAPI()
REQUEST_COUNT = Counter("model_requests_total", "Total inference requests", ["model_version", "status"])
REQUEST_LATENCY = Histogram("model_latency_seconds", "Inference latency", ["model_version"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5])
PREDICTION_DIST = Counter("model_predictions_total", "Prediction distribution", ["label"])
MODEL_CONFIDENCE = Histogram("model_confidence", "Prediction confidence scores",
buckets=[0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99])
@app.post("/predict")
async def predict(request: PredictRequest):
start = time.time()
try:
result = model.infer(request.features)
REQUEST_COUNT.labels(model_version=MODEL_VERSION, status="success").inc()
PREDICTION_DIST.labels(label=result.label).inc()
MODEL_CONFIDENCE.observe(result.confidence)
return result
except Exception as e:
REQUEST_COUNT.labels(model_version=MODEL_VERSION, status="error").inc()
raise
finally:
REQUEST_LATENCY.labels(model_version=MODEL_VERSION).observe(time.time() - start)
@app.get("/metrics")
def metrics():
return Response(generate_latest(), media_type="text/plain")
Scrape your model server from Prometheus:
scrape_configs:
- job_name: ml_model_servers
scrape_interval: 15s
static_configs:
- targets:
- anomaly-detector:8080
- chatbot-api:8080
relabel_configs:
- source_labels: [__address__]
target_label: instance
Build a model health dashboard with four key panels:
rate(model_requests_total[5m]) — shows traffic patterns and dropsrate(model_requests_total{status="error"}[5m]) / rate(model_requests_total[5m])histogram_quantile(0.99, rate(model_latency_seconds_bucket[5m]))Set Grafana alerts on p99 latency exceeding 500 ms or error rate exceeding 1%.
Log every inference request as structured JSON for debugging and compliance:
import structlog
log = structlog.get_logger()
def predict_and_log(request_id: str, features: dict, model_version: str):
result = model.infer(features)
log.info("inference",
request_id=request_id,
model_version=model_version,
input_hash=hash_features(features),
prediction=result.label,
confidence=result.confidence,
latency_ms=result.latency_ms,
)
return result
Ship these logs via Filebeat to Logstash for enrichment, then index in Elasticsearch. Kibana's Lens visualisation lets you build ad-hoc queries like "all low-confidence predictions in the last hour for input features matching pattern X."
Statistical drift detection at the feature level catches distribution shifts before they degrade accuracy. Compute PSI (Population Stability Index) on incoming feature distributions versus the training baseline, and alert when PSI exceeds 0.2:
def psi(expected: np.ndarray, actual: np.ndarray, buckets: int = 10) -> float:
expected_pcts = np.histogram(expected, bins=buckets)[0] / len(expected)
actual_pcts = np.histogram(actual, bins=buckets, range=(expected.min(), expected.max()))[0] / len(actual)
psi_value = np.sum((actual_pcts - expected_pcts) * np.log((actual_pcts + 1e-8) / (expected_pcts + 1e-8)))
return psi_value
A production ML system without observability is flying blind. Instrument your model servers from day one, build dashboards that surface model behaviour — not just infrastructure health — and invest in drift detection so you know when your model's world has changed. The operational cost of a silent model degradation vastly exceeds the engineering cost of building the monitoring upfront.
Continue reading — handpicked articles you might enjoy