▶ Classic

Autoresearch

Personal Product Mar 2026
AI/ML Claude Code Evals Prompt Engineering Andrej Karpathy Binary Evals Hamel Husain

Problem Statement

Claude Code skills "work" but nobody measures how reliably. You write a skill, run it a few times, it looks fine, you ship it. But when you actually measure with structured evals, you discover your skill is passing only 59% of the time. Vibes-based development doesn't scale.

Hypothesis

Apply Karpathy's autoresearch methodology (autonomous research loops with self-evaluation) to Claude Code skills. Define binary evals (yes/no, no scales), run the skill repeatedly, score outputs, analyze failures, mutate the prompt, and keep improvements. Let the system optimize itself.

Solution

Autoresearch is a Claude Code skill that autonomously optimizes other skills through a loop:

  • Run — execute the target skill on test inputs
  • Score — evaluate outputs against 3-6 binary evals (yes/no only)
  • Analyze — identify failure patterns and root causes
  • Mutate — modify the skill's prompt to address failures
  • Keep or discard — only keep mutations that improve the score

The loop repeats until the skill hits 95%+ or you stop it.

Case Study: 59% to 97%

The /product-manager skill handles gap analysis, competitor research, and PRD generation. I defined 6 binary evals and ran 20 tests per experiment (120 checks total):

  • Baseline: 59% pass rate. The skill was silently failing on structure, scoring format, and citation quality
  • Experiment 1: Jumped to 82%. Mutations fixed scoring format and added explicit section headers
  • Experiment 2: Hit 91%. Addressed edge cases in competitor research depth
  • Experiment 3: Reached 97%. Final mutations tightened citation requirements and output consistency

Key Product Decisions

  • Binary evals only. No 1-5 scales, no "mostly good." Either the output has the required structure or it doesn't. Binary evals are unambiguous and automatable
  • Small eval sets. 3-6 evals per skill. More evals means more noise and slower iteration. Focus on the criteria that actually matter
  • Mutation, not rewrite. Each iteration makes targeted changes to the prompt, not wholesale rewrites. This preserves what's already working and isolates the effect of each change

Impact & Metrics

59% → 97%
Skill pass rate improvement in 3 experiments
120 Checks
Binary evals across 20 test runs per experiment
Open Source

Arush Sharma Arush Sharma click anywhere to close