Metrics Reference
Complete reference of all Prometheus metrics exposed by the Rastir collector server on the /metrics endpoint.
All metrics are derived server-side from ingested span data. The client library does not expose a metrics endpoint.
Span Counters
Core counters tracking span ingestion and call volumes.
| Metric | Type | Labels | Description |
|---|---|---|---|
rastir_spans_ingested_total | Counter | service, env, span_type, status | Total spans ingested |
rastir_llm_calls_total | Counter | service, env, model, provider, agent | LLM invocations |
rastir_tokens_input_total | Counter | service, env, model, provider, agent | Input (prompt) tokens |
rastir_tokens_output_total | Counter | service, env, model, provider, agent | Output (completion) tokens |
rastir_tool_calls_total | Counter | service, env, tool_name, agent, model, provider | Tool invocations |
rastir_retrieval_calls_total | Counter | service, env, agent, model, provider | Retrieval operations |
rastir_errors_total | Counter | service, env, span_type, error_type, model, provider, agent | Error spans by normalised category |
Histograms
Histograms track the distribution of values, enabling percentile calculations via PromQL’s histogram_quantile().
| Metric | Type | Labels | Default Buckets | Unit |
|---|---|---|---|---|
rastir_duration_seconds | Histogram | service, env, span_type, model, provider | 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0 | seconds |
rastir_tokens_per_call | Histogram | service, env, model, provider | 10, 50, 100, 250, 500, 1000, 2000, 4000, 8000, 16000, 32000 | tokens |
rastir_cost_per_call_usd | Histogram | service, env, model | 0.0001, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100 | USD |
rastir_ttft_seconds | Histogram | service, env, model, provider | 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10 | seconds |
Maximum of 20 buckets per histogram. Buckets are configurable via YAML or environment variables — see Configuration.
What Prometheus Exposes
For each histogram, Prometheus creates:
rastir_duration_seconds_bucket{..., le="0.25"} → count of spans ≤ 0.25s
rastir_duration_seconds_bucket{..., le="1.0"} → count of spans ≤ 1.0s
rastir_duration_seconds_bucket{..., le="+Inf"} → total count
rastir_duration_seconds_sum{...} → sum of all values
rastir_duration_seconds_count{...} → same as +Inf bucket
Percentile Queries — P50, P95, P99
# P50 (median) LLM call duration
histogram_quantile(0.50,
rate(rastir_duration_seconds_bucket{span_type="llm"}[5m])
)
# P95 LLM call duration
histogram_quantile(0.95,
rate(rastir_duration_seconds_bucket{span_type="llm"}[5m])
)
# P99 LLM call duration — tail latency
histogram_quantile(0.99,
rate(rastir_duration_seconds_bucket{span_type="llm"}[5m])
)
# P95 duration per model
histogram_quantile(0.95,
sum by (model, le) (
rate(rastir_duration_seconds_bucket{span_type="llm"}[5m])
)
)
# P50 tokens per LLM call
histogram_quantile(0.50,
rate(rastir_tokens_per_call_bucket[5m])
)
# P95 tool execution time
histogram_quantile(0.95,
rate(rastir_duration_seconds_bucket{span_type="tool"}[5m])
)
Average & Throughput Queries
# Average LLM call duration
rate(rastir_duration_seconds_sum{span_type="llm"}[5m])
/
rate(rastir_duration_seconds_count{span_type="llm"}[5m])
# LLM calls per second
rate(rastir_llm_calls_total[5m])
# Error rate as a percentage
rate(rastir_errors_total[5m])
/
rate(rastir_spans_ingested_total[5m]) * 100
# Average tokens per call
rate(rastir_tokens_per_call_sum[5m])
/
rate(rastir_tokens_per_call_count[5m])
Cost & TTFT Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
rastir_cost_total | Counter | service, env, model, provider, agent, pricing_profile | Accumulated USD cost |
rastir_cost_per_call_usd | Histogram | service, env, model | Cost distribution per LLM call |
rastir_pricing_missing_total | Counter | service, env, model, provider | LLM calls missing pricing data |
rastir_ttft_seconds | Histogram | service, env, model, provider | Time-To-First-Token for streaming calls |
Cost metrics are only recorded when the client sends cost_usd as a span attribute (requires enable_cost_calculation=True). TTFT metrics are only recorded for streaming LLM spans that include ttft_ms.
The pricing_profile label on rastir_cost_total is cardinality-guarded with a cap of 20 distinct values. The rastir_cost_per_call_usd histogram intentionally excludes pricing_profile to prevent cardinality explosion.
Guardrail Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
rastir_guardrail_requests_total | Counter | service, env, provider, model, agent, guardrail_id, guardrail_version | Guardrail-enabled LLM calls |
rastir_guardrail_violations_total | Counter | service, env, provider, model, agent, guardrail_id, guardrail_action, guardrail_category | Guardrail interventions |
Guardrail labels are cardinality-guarded with bounded enum validation:
| Label | Allowed Values |
|---|---|
guardrail_category | CONTENT_POLICY, SENSITIVE_INFORMATION_POLICY, WORD_POLICY, TOPIC_POLICY, CONTEXTUAL_GROUNDING_POLICY, DENIED_TOPIC |
guardrail_action | GUARDRAIL_INTERVENED, NONE |
guardrail_id | Subject to cardinality cap (default: 100) |
Unknown values are replaced with __cardinality_overflow__. Validation runs on both the client adapter and the server.
Evaluation Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
rastir_evaluation_runs_total | Counter | service, env, model, provider, agent, evaluation_type, evaluator_model, evaluator_provider | Evaluation runs |
rastir_evaluation_failures_total | Counter | service, env, model, provider, agent, evaluation_type, evaluator_model, evaluator_provider | Failed evaluations |
rastir_evaluation_latency_seconds | Histogram | service, env, model, provider, agent, evaluation_type, evaluator_model, evaluator_provider | Evaluation execution time |
rastir_evaluation_score | Gauge | service, env, model, provider, agent, evaluation_type, evaluator_model, evaluator_provider | Evaluation score |
rastir_evaluation_queue_size | Gauge | — | Evaluation queue depth |
rastir_evaluation_queue_utilization_percent | Gauge | — | Evaluation queue fill percentage |
rastir_evaluation_dropped_total | Counter | service, env | Evaluations dropped due to full queue |
Operational Metrics
Server health and performance metrics.
| Metric | Type | Description |
|---|---|---|
rastir_queue_size | Gauge | Current ingestion queue depth |
rastir_queue_utilization_percent | Gauge | Queue fill percentage |
rastir_memory_bytes | Gauge | Server process RSS memory |
rastir_trace_store_size | Gauge | Total spans in trace store |
rastir_active_traces | Gauge | Distinct trace count in store |
rastir_ingestion_rate | Gauge | Spans ingested per second |
rastir_ingestion_rejections_total | Counter | Rejected spans (backpressure) |
rastir_export_failures_total | Counter | OTLP export failures |
rastir_redaction_applied_total | Counter | Redaction rules applied (labels: service, env) |
rastir_redaction_failures_total | Counter | Redaction processing failures (labels: service, env) |
Sampling & Backpressure Metrics
| Metric | Type | Description |
|---|---|---|
rastir_spans_sampled_total | Counter | Spans retained after sampling |
rastir_spans_dropped_by_sampling_total | Counter | Spans dropped by sampling |
rastir_backpressure_warnings_total | Counter | Soft limit warnings |
rastir_spans_dropped_by_backpressure_total | Counter | Spans dropped by backpressure |
rastir_rate_limited_total | Counter | Rate-limited requests (by dimension) |
Error Type Normalisation
The rastir_errors_total counter uses normalised error categories instead of raw exception class names, preventing unbounded label cardinality.
| Normalised category | Matched exception patterns |
|---|---|
timeout | TimeoutError, asyncio.TimeoutError, httpx.TimeoutException, httpx.ReadTimeout, httpx.ConnectTimeout, requests.exceptions.Timeout, openai.APITimeoutError |
rate_limit | RateLimitError, openai.RateLimitError, anthropic.RateLimitError |
validation_error | ValueError, TypeError, ValidationError, pydantic.ValidationError |
provider_error | openai.APIError, openai.APIConnectionError, openai.APIStatusError, anthropic.APIError, anthropic.APIConnectionError, anthropic.APIStatusError, botocore.exceptions.ClientError |
internal_error | RuntimeError, Exception |
unknown | Any unrecognised exception type |
Normalisation uses exact match first, then substring heuristics (e.g., any exception with “timeout” in the name maps to timeout).
Cardinality Guards
All high-cardinality labels are subject to per-dimension caps. Values exceeding the cap are replaced with __cardinality_overflow__.
| Label | Default Cap | Applies To |
|---|---|---|
model | 50 | llm_calls, tokens_*, cost_*, guardrail_*, evaluation_* |
provider | 10 | Same as model |
tool_name | 200 | tool_calls |
agent | 200 | llm_calls, tokens_*, tool_calls, guardrail_* |
error_type | 50 | errors |
guardrail_id | 100 | guardrail_* |
pricing_profile | 20 | cost_total |
Caps are configurable via server config — see Configuration.
SRE Config Gauges
These gauges are set once at server startup from the sre: configuration section. They are consumed by Prometheus recording rules to compute error budgets, cost budgets, burn rates, and exhaustion estimates.
| Metric | Type | Labels | Description |
|---|---|---|---|
rastir_slo_error_rate | Gauge | agent | Configured SLO error rate (e.g. 0.01 = 1%). Set per agent with fallback via agent="unknown" |
rastir_cost_budget_usd | Gauge | agent | Configured cost budget in USD per period. Set per agent with fallback via agent="unknown" |
These gauges are not updated at runtime — they reflect the static configuration. The derived SRE metrics (error budget remaining, burn rates, days to exhaustion, etc.) are computed entirely by Prometheus recording rules. See Server — SRE Recording Rules for details.
Derived Recording Rules
The following recording-rule time-series are generated by Prometheus (not by the Rastir server) and consumed by the SRE Budgets dashboard:
| Recording Rule | Description |
|---|---|
rastir:volume:{week,month} | Request volume capped at month boundary |
rastir:expected_volume:{week,month} | Rolling request volume for projection |
rastir:errors:{week,month} | Error count capped at month boundary |
rastir:errors_by_model:{week,month} | Error count by model (for pie charts) |
rastir:error_budget_total:{week,month} | Total error budget (volume × SLO) |
rastir:error_budget_remaining:{week,month} | Remaining error budget |
rastir:error_budget_consumed_pct:{week,month} | Consumed error budget % |
rastir:sla_status:{week,month} | SLA health (1 = healthy, 0 = breached) |
rastir:error_days_to_exhaustion:{week,month} | Days until error budget is exhausted |
rastir:error_burn_rate:{1h,6h} | Short/long window error burn rate |
rastir:cost:{week,month} | Cost consumed in period |
rastir:cost_budget_total:{week,month} | Configured cost budget |
rastir:cost_budget_remaining:{week,month} | Remaining cost budget |
rastir:cost_budget_consumed_pct:{week,month} | Cost budget consumed % |
rastir:cost_burn_rate_daily:{week,month} | Average daily cost burn rate |
rastir:cost_days_to_exhaustion:{week,month} | Days until cost budget is exhausted |
Exemplar Support
Exemplars attach a trace_id to histogram observations and counter increments, creating a direct link from a Prometheus metric to the distributed trace that produced it.
Metrics That Carry Exemplars
| Metric | Exemplar Label |
|---|---|
rastir_duration_seconds | trace_id |
rastir_llm_calls_total | trace_id |
Enabling Exemplars
# server-config.yml
exemplars:
enabled: true
Or: export RASTIR_SERVER_EXEMPLARS_ENABLED=true
When enabled, the /metrics endpoint automatically switches to OpenMetrics format (required for exemplars).
Output Format
# Without exemplars (classic Prometheus)
rastir_duration_seconds_bucket{...,le="1.0"} 42
# With exemplars (OpenMetrics)
rastir_duration_seconds_bucket{...,le="1.0"} 42 # {trace_id="a1b2c3d4"} 0.847 1709042400.0
Grafana Integration
- Edit your Prometheus data source → enable Exemplars toggle
- Set Internal link to your Jaeger/Tempo data source
- Map label
trace_idto trace ID field - In panel queries, toggle Exemplars on — they appear as diamond markers
- Click a diamond to jump directly to the trace
Grafana: P95 latency spike at 14:32 →
Click exemplar diamond →
Tempo: trace_id=a1b2c3d4 →
research_agent (2.3s)
├─ plan_step (0.8s) OK
├─ web_search (1.2s) ← slow!
└─ synthesize (0.3s) OK