Problem Statement

Claude Code skills "work" but nobody measures how reliably. You write a skill, run it a few times, it looks fine, you ship it. But when you actually measure with structured evals, you discover your skill is passing only 59% of the time. Vibes-based development doesn't scale.

Hypothesis

Apply Karpathy's autoresearch methodology (autonomous research loops with self-evaluation) to Claude Code skills. Define binary evals (yes/no, no scales), run the skill repeatedly, score outputs, analyze failures, mutate the prompt, and keep improvements. Let the system optimize itself.

Solution

Autoresearch is a Claude Code skill that autonomously optimizes other skills through a loop:

Run — execute the target skill on test inputs
Score — evaluate outputs against 3-6 binary evals (yes/no only)
Analyze — identify failure patterns and root causes
Mutate — modify the skill's prompt to address failures
Keep or discard — only keep mutations that improve the score

The loop repeats until the skill hits 95%+ or you stop it.

Case Study: 59% to 97%

The /product-manager skill handles gap analysis, competitor research, and PRD generation. I defined 6 binary evals and ran 20 tests per experiment (120 checks total):

Baseline: 59% pass rate. The skill was silently failing on structure, scoring format, and citation quality
Experiment 1: Jumped to 82%. Mutations fixed scoring format and added explicit section headers
Experiment 2: Hit 91%. Addressed edge cases in competitor research depth
Experiment 3: Reached 97%. Final mutations tightened citation requirements and output consistency

Key Product Decisions

Binary evals only. No 1-5 scales, no "mostly good." Either the output has the required structure or it doesn't. Binary evals are unambiguous and automatable
Small eval sets. 3-6 evals per skill. More evals means more noise and slower iteration. Focus on the criteria that actually matter
Mutation, not rewrite. Each iteration makes targeted changes to the prompt, not wholesale rewrites. This preserves what's already working and isolates the effect of each change

Impact & Metrics

59% → 97%

Skill pass rate improvement in 3 experiments

120 Checks

Binary evals across 20 test runs per experiment

Open Source

View on GitHub ↗

Links

GitHub Repository ↗ · Medium Article ↗ · Andrej Karpathy ↗ · Hamel Husain ↗

← All Products Gemini Chat Exporter →