Skip to content

Monitoring Guide

How to use the Orion monitoring stack: Prometheus metrics, Grafana dashboards, the System Health dashboard page, and the CLI.

Starting the Monitoring Stack

The monitoring stack (Prometheus + Grafana) runs as a separate Docker Compose profile:

make up-monitoring

Or manually:

docker compose -f deploy/docker-compose.yml -f deploy/docker-compose.monitoring.yml up -d

This starts all Orion services plus:

  • Prometheus at http://localhost:9090
  • Grafana at http://localhost:3003 (default login: admin / admin)

Grafana Dashboards

Default Dashboards

Orion ships with three pre-built Grafana dashboards that are auto-provisioned on startup from deploy/grafana/provisioning/dashboards/json/. No manual configuration is required -- just start the monitoring stack and the dashboards are ready to use at http://localhost:3003.

Three pre-built dashboards are auto-provisioned from deploy/grafana/provisioning/dashboards/json/:

Orion Overview

The main operational dashboard. Key panels:

Panel What It Shows
Request Rate Requests per second by service
Error Rate 5xx errors per second
P95 Latency 95th percentile response time
Active WebSockets Current WebSocket connections
Pipeline Status Pipeline runs by status (completed, failed, etc)

Provider Health

AI provider availability and performance:

Panel What It Shows
Provider Status Connection status for each AI provider
Response Times Latency for LLM, image, video, TTS calls
Cost Tracking Estimated cost per provider per hour
Error Rates Provider-specific failure rates

GPU & Resources

Requires the --profile gpu flag when starting Docker Compose:

Panel What It Shows
GPU Utilization Real-time GPU usage percentage
GPU Memory VRAM usage and available memory
GPU Temperature Current temperature reading
CPU / RAM Host CPU and memory utilization

Prometheus Metrics

Every Orion service exposes a /metrics endpoint scraped by Prometheus at 15-second intervals.

Scrape Targets

Service Endpoint
Gateway gateway:8000
Scout scout:8001
Director director:8002
Media media:8003
Editor editor:8004
Pulse pulse:8005
Milvus milvus:9091
Ollama ollama:11434

Useful PromQL Queries

# Request rate by service
rate(http_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# Pipeline success rate (last hour)
sum(rate(pipeline_runs_total{status="completed"}[1h]))
  / sum(rate(pipeline_runs_total[1h]))

# Trends detected in the last 24 hours
increase(trends_detected_total[24h])

Alerting

Use Prometheus alerting rules to get notified when error rates spike or services go down. Pre-built alert rules are available in deploy/prometheus/alert_rules.yml. Configure notification channels (Slack, email, PagerDuty) in Grafana under Alerting > Contact points.


Dashboard System Health Page

The Orion Dashboard includes a built-in System page at http://localhost:3001/system:

  • Service status cards -- health status of each microservice (gateway, scout, director, media, editor, pulse)
  • GPU utilization gauge -- real-time GPU usage when running with the GPU profile
  • Queue depth -- number of content items in each pipeline stage

The System page polls the gateway /api/v1/system/health endpoint and updates in real time.


CLI Monitoring Commands

System Status

orion system status
Orion System Status
───────────────────
Mode:       LOCAL
GPU:        Available (NVIDIA RTX 4090, 24GB)
Services:   6/6 healthy
Queue:      3 items pending
Uptime:     2h 15m

Health Check

orion system health
┌───────────┬─────────┬──────────────┬──────────┐
│ Service   │ Status  │ Latency      │ Version  │
├───────────┼─────────┼──────────────┼──────────┤
│ gateway   │ healthy │ 2ms          │ 0.1.0    │
│ scout     │ healthy │ 15ms         │ 0.1.0    │
│ director  │ healthy │ 12ms         │ 0.1.0    │
│ media     │ healthy │ 18ms         │ 0.1.0    │
│ editor    │ healthy │ 14ms         │ 0.1.0    │
│ pulse     │ healthy │ 11ms         │ 0.1.0    │
└───────────┴─────────┴──────────────┴──────────┘

JSON Output for Automation

orion system health --format json | jq '.services | to_entries[] | select(.value.status != "healthy")'

This returns only unhealthy services -- useful for CI checks or alerting scripts.


Next Steps