InfraWeaveDocs
Core concepts

Observability

Traces joined to runs, per-node metrics, and cost attribution.

Every step is traced

Each node execution is an activity with an OpenTelemetry span; spans join to the run, so the question is never "what logs mention this?" but "show me the run." Model calls carry gen_ai spans with token counts and cost; tool invocations carry the gateway's decision trail. PII is redacted before spans leave the boundary.

The trace inspector

Click any node in the workflow graph to open the inspector:

  • Overview — status, latency, tokens, cost, and loop iterations (each iteration with its own outcome and duration).
  • Input / Output — the exact JSON payloads that crossed the node boundary.
  • Trace — the span timeline under the node: activity.execute, llm.completion, tool.mcp.*, rag.retrieve, each with duration bars.

Metrics that matter

The observability dashboard aggregates per workflow, per agent, and per node:

  • Throughput — runs per hour, live.
  • Success rate and error taxonomy (transient vs terminal).
  • Latency — p50/p95 per node; p95 2.4s is a first-class number here, not a footnote.
  • Cost$ / run and $ / call, attributed down to the individual model call via the provenance triple (model_id, model_version, output_id).
  • Tokens — per node, per run, per agent.

Audit-first

Every security-relevant event — policy decisions, secret access, tool invocations, version publishes, permission changes — lands in an append-only audit log. Every surface answers what changed, who did it, when, and why.

On this page