Example Input
agent_runtime: OpenClaw | skill_format: markdown SKILL.md with YAML frontmatter | trace_store: Postgres (table: agent_traces) | deploy_surface: Git-tracked skills/ directory, PR-based rollout | improvement_goals: Reduce 'missing tool' failures by 50% over 30 days without any skill regressing on its existing eval set. | language: Python 3.12
Example Output
[Architecture diagram] trace_ingest → failure_classifier (Claude Sonnet) → candidate_generator (Claude Opus) → shadow_evaluator (parallel replay) → PR bot → human review → merge. [Trace schema] pydantic Trace(id, agent_id, skill_id, tool_calls: list[ToolCall], outcome: Literal['ok','error','timeout'], latency_ms, tokens_in, tokens_out, created_at). SkillRevision(skill_id, parent_sha, diff_text, rationale, linked_traces: list[UUID], shadow_metrics: ShadowMetrics, status: Literal['shadow','promoted','rolled_back']). [Classifier prompt] … [Generator] reads ≥ 5 traces sharing a failure class, emits unified diff against SKILL.md, attaches rationale referencing trace IDs. [Shadow harness] replays last 200 traces in parallel (asyncio.gather, semaphore=16), computes pass_rate_delta, cost_delta, p95_delta. [Promotion rule] promote iff pass_rate_delta ≥ +0.05 AND cost_delta ≤ +10% AND p95_delta ≤ +15% AND no regression on core_eval_suite. [Rollback] wired to live SLO dashboard — if error_rate > 2× baseline for 10 min, auto-revert the last 3 merged SkillRevisions and page on-call. [Kill switch] SKILLS_AUTOPILOT=off disables the generator; existing skills untouched. Full Python code for each component provided.