Skip to content

Prometheus

Prometheus scrapes metrics from all Orion services at 15-second intervals.

Configuration

The configuration lives at deploy/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    project: orion

rule_files:
  - "/etc/prometheus/alert_rules.yml"

Scrape Targets

Job Target Metrics Path
prometheus localhost:9090 /metrics
gateway gateway:8000 /metrics
scout scout:8001 /metrics
director director:8002 /metrics
media media:8003 /metrics
editor editor:8004 /metrics
pulse pulse:8005 /metrics
milvus milvus:9091 /metrics
ollama ollama:11434 /api/metrics

Optional Exporters

Uncomment in prometheus.yml to enable:

Job Target Image
postgres postgres-exporter:9187 prometheuscommunity/postgres-exporter
redis redis-exporter:9121 oliver006/redis_exporter

Key Metrics

Gateway (Go)

Metric Type Description
http_requests_total Counter Total HTTP requests by method, path, status
http_request_duration_seconds Histogram Request latency distribution
http_requests_in_flight Gauge Currently active requests
websocket_connections_active Gauge Active WebSocket connections

Python Services

Metric Type Description
http_requests_total Counter FastAPI request count
http_request_duration_seconds Histogram Request duration
event_bus_messages_published_total Counter Redis events published
event_bus_messages_received_total Counter Redis events consumed

Service-Specific

Service Metric Description
Scout trends_detected_total Total trends detected
Director pipeline_runs_total Pipeline executions by status
Director pipeline_duration_seconds Pipeline execution time
Media images_generated_total Images generated by provider
Editor videos_rendered_total Videos rendered
Pulse events_aggregated_total Events processed

Querying

Access Prometheus at http://localhost:9090 and use PromQL:

# Request rate by service
rate(http_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# Pipeline success rate
sum(rate(pipeline_runs_total{status="completed"}[1h]))
  / sum(rate(pipeline_runs_total[1h]))