Showcase 5 (wip) · Telemetry UI

Agent Skills Dashboard

A sample observability surface for agent runs — KPIs, a latency-vs-success scatter across models, and per-tool accuracy. Designed as the aggregation layer that would sit over a production agent_runs telemetry table.

Sample observability UI — metrics are placeholders. Numbers below come from a static mock dataset, not a live telemetry pipeline.

Total runs

6,237

Success rate

91.4%

Avg latency

836ms

Tool accuracy

96.7%

Success Rate vs. Latency

Each dot is a model — top-left is the sweet spot (high success, low latency).

Tool Usage Accuracy

Per-tool success vs. failure across the mock dataset.

Model breakdown

ModelHostingSuccessLatencyRuns$/1k runs
Claude Sonnet 4.5cloud96.4%920ms1,284$18.40
GPT-4ocloud95.1%1180ms1,547$12.90
GPT-4o-minicloud91.8%610ms2,103$1.85
Llama 3 70Blocal87.2%1640ms412$0.45
Llama 3 8Blocal78.6%280ms891$0.12

How it works

Data source (today)

A static TypeScript module — src/lib/agent/dashboard-data.ts — exports a small mock dataset of model and tool stats with realistic distributions. Aggregations (weighted success, weighted latency, tool accuracy) are computed in-process on import.

Data source (planned)

The intended production shape: every agent run from the other showcases writes a row to a Postgres agent_runs table — model, tool calls, latency, success, cost — and this page becomes a thin aggregation layer over it.

Charts

Powered by Recharts. The scatter plot reveals the latency-vs-quality frontier: cloud models cluster top-right (slow + accurate); local Llama variants trade quality for speed and cost.

Live tool-use playground

The ReAct tool-use playground lives on its own page — see /playground to trigger an agent run and watch the trace stream in.