Autoresearch
Problem Statement
Claude Code skills "work" but nobody measures how reliably. You write a skill, run it a few times, it looks fine, you ship it. But when you actually measure with structured evals, you discover your skill is passing only 59% of the time. Vibes-based development doesn't scale.
Hypothesis
Apply Karpathy's autoresearch methodology (autonomous research loops with self-evaluation) to Claude Code skills. Define binary evals (yes/no, no scales), run the skill repeatedly, score outputs, analyze failures, mutate the prompt, and keep improvements. Let the system optimize itself.
Solution
Autoresearch is a Claude Code skill that autonomously optimizes other skills through a loop:
- Run — execute the target skill on test inputs
- Score — evaluate outputs against 3-6 binary evals (yes/no only)
- Analyze — identify failure patterns and root causes
- Mutate — modify the skill's prompt to address failures
- Keep or discard — only keep mutations that improve the score
The loop repeats until the skill hits 95%+ or you stop it.
Case Study: 59% to 97%
The /product-manager skill handles gap analysis, competitor research, and PRD generation. I defined 6 binary evals and ran 20 tests per experiment (120 checks total):
- Baseline: 59% pass rate. The skill was silently failing on structure, scoring format, and citation quality
- Experiment 1: Jumped to 82%. Mutations fixed scoring format and added explicit section headers
- Experiment 2: Hit 91%. Addressed edge cases in competitor research depth
- Experiment 3: Reached 97%. Final mutations tightened citation requirements and output consistency
Key Product Decisions
- Binary evals only. No 1-5 scales, no "mostly good." Either the output has the required structure or it doesn't. Binary evals are unambiguous and automatable
- Small eval sets. 3-6 evals per skill. More evals means more noise and slower iteration. Focus on the criteria that actually matter
- Mutation, not rewrite. Each iteration makes targeted changes to the prompt, not wholesale rewrites. This preserves what's already working and isolates the effect of each change