Monitoring Guide¶
How to use the Orion monitoring stack: Prometheus metrics, Grafana dashboards, the System Health dashboard page, and the CLI.
Starting the Monitoring Stack¶
The monitoring stack (Prometheus + Grafana) runs as a separate Docker Compose profile:
Or manually:
This starts all Orion services plus:
- Prometheus at
http://localhost:9090 - Grafana at
http://localhost:3003(default login:admin/admin)
Grafana Dashboards¶
Default Dashboards
Orion ships with three pre-built Grafana dashboards that are auto-provisioned on startup from deploy/grafana/provisioning/dashboards/json/. No manual configuration is required -- just start the monitoring stack and the dashboards are ready to use at http://localhost:3003.
Three pre-built dashboards are auto-provisioned from deploy/grafana/provisioning/dashboards/json/:
Orion Overview¶
The main operational dashboard. Key panels:
| Panel | What It Shows |
|---|---|
| Request Rate | Requests per second by service |
| Error Rate | 5xx errors per second |
| P95 Latency | 95th percentile response time |
| Active WebSockets | Current WebSocket connections |
| Pipeline Status | Pipeline runs by status (completed, failed, etc) |
Provider Health¶
AI provider availability and performance:
| Panel | What It Shows |
|---|---|
| Provider Status | Connection status for each AI provider |
| Response Times | Latency for LLM, image, video, TTS calls |
| Cost Tracking | Estimated cost per provider per hour |
| Error Rates | Provider-specific failure rates |
GPU & Resources¶
Requires the --profile gpu flag when starting Docker Compose:
| Panel | What It Shows |
|---|---|
| GPU Utilization | Real-time GPU usage percentage |
| GPU Memory | VRAM usage and available memory |
| GPU Temperature | Current temperature reading |
| CPU / RAM | Host CPU and memory utilization |
Prometheus Metrics¶
Every Orion service exposes a /metrics endpoint scraped by Prometheus at 15-second intervals.
Scrape Targets¶
| Service | Endpoint |
|---|---|
| Gateway | gateway:8000 |
| Scout | scout:8001 |
| Director | director:8002 |
| Media | media:8003 |
| Editor | editor:8004 |
| Pulse | pulse:8005 |
| Milvus | milvus:9091 |
| Ollama | ollama:11434 |
Useful PromQL Queries¶
# Request rate by service
rate(http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# Pipeline success rate (last hour)
sum(rate(pipeline_runs_total{status="completed"}[1h]))
/ sum(rate(pipeline_runs_total[1h]))
# Trends detected in the last 24 hours
increase(trends_detected_total[24h])
Alerting
Use Prometheus alerting rules to get notified when error rates spike or services go down. Pre-built alert rules are available in deploy/prometheus/alert_rules.yml. Configure notification channels (Slack, email, PagerDuty) in Grafana under Alerting > Contact points.
Dashboard System Health Page¶
The Orion Dashboard includes a built-in System page at http://localhost:3001/system:
- Service status cards -- health status of each microservice (gateway, scout, director, media, editor, pulse)
- GPU utilization gauge -- real-time GPU usage when running with the GPU profile
- Queue depth -- number of content items in each pipeline stage
The System page polls the gateway /api/v1/system/health endpoint and updates in real time.
CLI Monitoring Commands¶
System Status¶
Orion System Status
───────────────────
Mode: LOCAL
GPU: Available (NVIDIA RTX 4090, 24GB)
Services: 6/6 healthy
Queue: 3 items pending
Uptime: 2h 15m
Health Check¶
┌───────────┬─────────┬──────────────┬──────────┐
│ Service │ Status │ Latency │ Version │
├───────────┼─────────┼──────────────┼──────────┤
│ gateway │ healthy │ 2ms │ 0.1.0 │
│ scout │ healthy │ 15ms │ 0.1.0 │
│ director │ healthy │ 12ms │ 0.1.0 │
│ media │ healthy │ 18ms │ 0.1.0 │
│ editor │ healthy │ 14ms │ 0.1.0 │
│ pulse │ healthy │ 11ms │ 0.1.0 │
└───────────┴─────────┴──────────────┴──────────┘
JSON Output for Automation¶
orion system health --format json | jq '.services | to_entries[] | select(.value.status != "healthy")'
This returns only unhealthy services -- useful for CI checks or alerting scripts.
Next Steps¶
- System Administration -- Service health and provider configuration
- Full Pipeline Demo -- End-to-end walkthrough
- Analytics Guide -- Pipeline performance and cost tracking
- Provider Setup -- Configure AI providers