Core concepts
Observability
Traces joined to runs, per-node metrics, and cost attribution.
Every step is traced
Each node execution is an activity with an OpenTelemetry span; spans join to the run, so the
question is never "what logs mention this?" but "show me the run." Model calls carry gen_ai
spans with token counts and cost; tool invocations carry the gateway's decision trail.
PII is redacted before spans leave the boundary.
The trace inspector
Click any node in the workflow graph to open the inspector:
- Overview — status, latency, tokens, cost, and loop iterations (each iteration with its own outcome and duration).
- Input / Output — the exact JSON payloads that crossed the node boundary.
- Trace — the span timeline under the node:
activity.execute,llm.completion,tool.mcp.*,rag.retrieve, each with duration bars.
Metrics that matter
The observability dashboard aggregates per workflow, per agent, and per node:
- Throughput — runs per hour, live.
- Success rate and error taxonomy (transient vs terminal).
- Latency — p50/p95 per node;
p95 2.4sis a first-class number here, not a footnote. - Cost —
$ / runand$ / call, attributed down to the individual model call via the provenance triple(model_id, model_version, output_id). - Tokens — per node, per run, per agent.
Audit-first
Every security-relevant event — policy decisions, secret access, tool invocations, version publishes, permission changes — lands in an append-only audit log. Every surface answers what changed, who did it, when, and why.
