CraftKit is a cross-agent toolkit for creating, improving, and operationalizing prompts and skills for coding agents such as Claude Code and Codex.
Prompt assets and agent skills often become fragmented, provider-specific, and hard to reuse. CraftKit exists to keep them file-first, portable, reviewable, and easy to improve over time.
CraftKit is an artifact-quality toolkit: it helps author, critique, tune, and carry forward prompts and skills. It is not a general coding-agent workflow suite, project-management layer, deployment system, or runtime framework. When a workflow needs those things, CraftKit should produce clear files, specs, or handoffs that another tool can use rather than becoming the tool itself.
All eight skills install as Agent Skills and are invocable in Claude Code through slash-style skill discovery. They are also plain SKILL.md files that can be used by Codex and other compatible agents.
npx skills add sungjunlee/craftkitAdd -g -y for global install without prompts:
npx skills add sungjunlee/craftkit -g -y/plugin marketplace add https://github.com/sungjunlee/craftkit.git
/plugin install craftkit@craftkit
Install from a local clone
git clone https://github.com/sungjunlee/craftkit.git
cd craftkit
npx skills add . -g -yFor Codex or any other agent, see Use in other agents below.
| Skill | Use when | Side effect |
|---|---|---|
craft-prompt |
a new prompt is needed from scratch for any LLM (Claude, GPT, Gemini, Perplexity, etc.) | returns copy-pasteable text |
craft-skill-spec |
a new skill needs a concrete spec based on current CraftKit skill-radar judgments before writing SKILL.md |
returns a spec; reads radar references |
craft-harness |
a project-specific agent harness needs to be built, repaired, synced, pruned, or evolved across Codex and Claude Code | may inspect and edit repo-local harness files; gates risky surfaces |
craft-critique |
an existing prompt or skill needs a read-only review before editing or shipping | surfaces strengths, prioritized findings, recommendations, and a rewrite plan without editing |
craft-tune |
an existing prompt or skill needs sharpening applied | runs an autonomous review-and-fix loop and edits the artifact |
craft-survey |
a new skill should be grounded in prior art before drafting | returns read-only recommendations |
craft-autoresearch |
a prompt or skill works "sometimes" and needs eval-driven iteration | runs evals and may edit mutable files |
craft-handoff |
a session is ending and the next session needs a copy-paste-ready continuation prompt | writes handoff files and may copy to clipboard |
When two skills could trigger, choose the least invasive one that answers the request: review-only wording goes to craft-critique; apply/fix/improve wording goes to craft-tune; repeated measurable failures go to craft-autoresearch; prior-art questions go to craft-survey; repo harness placement and Codex/Claude setup work goes to craft-harness.
Each skill lives at skills/<skill-name>/SKILL.md — plain markdown with YAML frontmatter, loadable as a Claude Code skill or copy-pasteable into any other agent.
Six of the eight skills (craft-prompt, craft-critique, craft-tune, craft-survey, craft-autoresearch, craft-handoff) have been optimized through craft-autoresearch passes against eval suites — including craft-autoresearch itself (reflexive meta-pass). craft-tune was reshaped to run an autonomous self-converging review-and-fix loop; the read-only diagnose role stays with the separate craft-critique skill. The next autoresearch pass will run against this shape. craft-skill-spec and craft-harness are new and have not yet been through an autoresearch pass. Per-session baseline → kept-state scores and mutation rationale live in the commit bodies. Run artifacts are preserved at ~/.craftkit/autoresearch/<skill>/<date-slug>/ outside the repo.
| Skill | Eval status | Score source | Known gap |
|---|---|---|---|
craft-prompt |
autoresearch pass completed | commit body + ~/.craftkit/autoresearch/craft-prompt/<date-slug>/ |
keep volatile provider guidance in guides |
craft-critique |
autoresearch pass completed | commit body + ~/.craftkit/autoresearch/craft-critique/<date-slug>/ |
re-run on fresh failure examples after major wording changes |
craft-tune |
autoresearch pass completed, then reshaped into self-converging loop | commit body + ~/.craftkit/autoresearch/craft-tune/<date-slug>/ |
next pass should test the newer loop contract |
craft-survey |
autoresearch pass completed | commit body + ~/.craftkit/autoresearch/craft-survey/<date-slug>/ |
example must keep proving provenance and edit-target rules |
craft-autoresearch |
reflexive autoresearch pass completed | commit body + ~/.craftkit/autoresearch/craft-autoresearch/<date-slug>/ |
examples must stay synchronized with the contract fields |
craft-skill-spec |
not yet autoresearched | none yet | first pass should test radar-dependent standalone behavior |
craft-harness |
not yet autoresearched | none yet | first pass should test lifecycle modes, Codex/Claude target separation, and risk gates |
craft-handoff |
autoresearch pass completed | commit body + ~/.craftkit/autoresearch/craft-handoff/2026-05-26-goal-pressure/ |
replay against real agent outputs to catch wording failures beyond deterministic checks |
- generating new prompts from scratch (task, research, session handoff, templates)
- prompt design and restructuring
- reusable skill design
- project-specific agent harness design and maintenance
- diagnostic review and minimal-diff editing
- iterative improvement loops
- survey-backed best practices
- time-aware curation of evolving skill-authoring patterns
- copy-pasteable outputs for agent workflows
- File-first and diff-friendly
- Small composable units
- Explicit inputs and outputs
- Cross-agent portability (core skill spines stay provider-neutral; platform-specific detail stays in guides or reference files)
- Eval-driven improvement when possible
- Copy-pasteable results over fancy abstractions
For evolving skill-authoring guidance, the craft-skill-spec skill carries its own radar layer at skills/craft-skill-spec/references/radar/ — start with current.md there and consult the dated snapshots only when a watch item needs deeper context.
craft-harness also ships reviewable hook asset recipes under skills/craft-harness/assets/hooks/. They provide shared .agents/hooks/scripts/ scripts plus Codex and Claude adapter snippets; they are intentionally not auto-installers.
AGENTS.md keeps a hard 500-line ceiling for each SKILL.md, but day-to-day edits should aim much lower:
- Normal skills: about 100-160 lines.
- Complex loop or orchestration skills: about 160-220 lines.
- Anything growing past that should move examples, platform notes, maintenance commands, or edge-case catalogs into
references/.
The spine should still be understandable alone: purpose, inputs, steps, output contract, one compact example, limitations, and links to on-demand references. References carry depth; the spine carries the operating path.
CraftKit skills are plain markdown with YAML frontmatter, so they port easily:
- Open the relevant
SKILL.md. - Paste the body (everything after the frontmatter) into the target agent's system prompt or instructions.
- Keep the frontmatter
descriptionline as context so the agent knows when to apply the skill.
See docs/examples/tune-a-prompt.md for a walk-through of diagnosing and tuning an existing prompt, then optionally running a short improvement loop.
sungjunlee/prompt-builder— predecessor project. Its mature prompt-authoring asset (5-step process, 6 building blocks, platform guides, templates) was absorbed wholesale intocraft-prompt. Kept on GitHub for reference; new work happens here.karpathy/autoresearch— Andrej Karpathy's ML training-loop project that introduced the autoresearch methodology (give an agent a baseline, let it experiment overnight, keep what improves, discard what doesn't).craft-autoresearchadapts that loop discipline to prompt and skill artifacts instead of model training code.byungjunjang/jangpm-meta-skills— four-skill meta toolkit for Claude Code and Codex (blueprint,deep-dive,reflect,autoresearch). Itsautoresearchskill contributed implementation patterns — experiment contract shape, the three-eval-type taxonomy (binary / comparative / fidelity), deletion discipline — thatcraft-autoresearchbuilds on.
MIT — see LICENSE.