Project Instructions — explainshell

A web tool that parses man pages and explains command-line arguments by matching each argument to its help text.

Tech Stack

Python 3.12, Flask, SQLite, bashlex, OpenAI SDK, Google Gemini SDK, LiteLLM (fallback)
Linting: ruff (Python), biome (JS)
Testing: pytest (unit + doctests + parsing regression), JS Playwright Test (e2e)
Dependencies: requirements.txt (main), package.json (Playwright e2e)

Workflow Requirements

Before finishing any task, always:

Run make format
Run tests — choose the right suite based on what changed:
- make tests-quick (lint + unit + parsing regression) — use when changes clearly cannot affect what the web app serves (e.g., extraction pipeline, CLI tooling, tests themselves)
- make tests-all (lint + unit + e2e + parsing regression) — use when changes might affect the web serving path (rendering, matching, storage, templates, static assets, config)
- When in doubt, run make tests-all
- If e2e tests fail due to snapshot diffs, assess whether the diff is expected, and get user confirmation before running make e2e-update
Update README.md if the change adds/removes/renames CLI commands, env vars, or user-facing features
Update AGENTS.md if the change affects structure, convention, workflow, etc.
Provide a draft commit message using Conventional Commits format

LLM Benchmarking

Use the benchmark tool (tools/llm_bench.py) to compare before/after metrics when making changes to the LLM extractor. It runs extraction on a default 10-file corpus (tests/regression/llm-bench/manpages/) and produces a JSON report with aggregate metrics: extracted files, failed files, total options, zero-option pages, multi-chunk pages, and token usage. Each run creates a timestamped directory under tests/regression/llm-bench/ containing the report and raw LLM responses (prompts and response text per chunk) for post-hoc investigation. Reports include git metadata (commit, dirty state).

Each run accepts an optional -d "..." to label what this run represents. When running benchmarks, always provide a description inferred from context — e.g. the task you're working on, the nature of local changes, or "baseline (clean)" for a pre-change run. This makes list and compare output self-explanatory.

Workflow for code changes (API, prompt, chunking, post-processing):

# 1. Stash your changes to get a clean baseline
git stash push -- explainshell/extraction/llm/

# 2. Run benchmark on the old code
python tools/llm_bench.py run --model openai/gpt-5-mini --batch 50 -d "baseline before <short summary of change>"

# 3. Restore your changes
git stash pop

# 4. Run benchmark on the new code
python tools/llm_bench.py run --model openai/gpt-5-mini --batch 50 -d "<short summary of change>"

# 5. Compare the two most recent reports
python tools/llm_bench.py compare

Usage:

# Run on the default corpus (auto-saves to report directory)
python tools/llm_bench.py run --model openai/gpt-5-mini

# Run with batch API
python tools/llm_bench.py run --model openai/gpt-5-mini --batch 50

# Run on specific files
python tools/llm_bench.py run --model openai/gpt-5-mini path/to/file.1.gz

# Save to a specific directory instead of the default report directory
python tools/llm_bench.py run --model openai/gpt-5-mini -o runs/my-test/report.json tests/regression/manpages/

# Compare the two most recent reports
python tools/llm_bench.py compare

# Compare two specific reports
python tools/llm_bench.py compare report1.json report2.json

# List all reports
python tools/llm_bench.py list

Code Style

Use Python type annotations on all new code (function signatures, return types, and non-obvious variables). Do not retroactively annotate existing code unless you are already modifying it.

Environment

Python virtualenv: repo-local .venv
CRITICAL: Every Bash tool call runs in a fresh shell with NO venv active. You MUST prefix every Python/pip/pytest/ruff/make command with source .venv/bin/activate &&. Example: source .venv/bin/activate && make tests. Never run bare python, pytest, ruff, pip, or make without activating first.

Common Commands

# Run unit tests + doctests (excludes e2e)
make tests

# Run a single test file
pytest tests/test_matcher.py -v

# Run a single test method
pytest tests/test_matcher.py::test_matcher::test_no_options -v

# Lint
make lint

# Format
make format

# Run e2e tests (requires playwright)
make e2e

# Update e2e snapshots
make e2e-update

# Run LLM integration test (requires API key in .env)
make test-llm

# Run parsing regression tests (requires DB)
make parsing-regression

# Update DB to accept current parser output for regression manpages
make parsing-update

# Run quick tests (unit + parsing regression, no e2e)
make tests-quick

# Run all tests (unit + e2e + parsing regression)
make tests-all

# Run DB integrity checks
make db-check

# Run web server locally
make serve

# Generate Ubuntu manpage archive (requires Go)
make ubuntu-archive UBUNTU_RELEASE=resolute

# Generate Arch Linux manpage archive (requires manned.org dump)
make arch-archive

# Process a man page into the database
python -m explainshell.manager extract --mode source /path/to/manpage.1.gz

Project Structure

explainshell/ - Main package
- manager.py - CLI entry point for man page processing (python -m explainshell.manager <command>)
- db_check.py - Database integrity checks (used by manager.py db-check)
- matcher.py - Core logic: walks bash AST and matches tokens to help text
- models.py - Core domain types (Option, ParsedManpage, RawManpage) as Pydantic/dataclass models
- store.py - SQLite storage layer
- errors.py - Exception hierarchy (ProgramDoesNotExist, DuplicateManpage, InvalidSourcePath, ExtractionError, SkippedExtraction, LowConfidenceError)
- diff.py - Man page comparison and diff formatting
- tree_parser.py - Mandoc -T tree output parser with confidence assessment
- roff_parser.py - Roff macro parser (man/mdoc dialects)
- roff_utils.py - Roff source detection (dashless opts, nested cmd)
- manpage.py - Man page reading and HTML conversion
- help_constants.py - Shell constant definitions for help text
- util.py - Shared utilities (group_continuous, Peekable, name_section)
- config.py - Configuration (DB_PATH, HOST_IP, DEBUG, MANPAGE_URLS)
- extraction/ - Man page option extraction pipeline
  - __init__.py - Public API: make_extractor(mode) factory
  - types.py - Shared types (ExtractionResult, ExtractionStats, BatchResult, ExtractorConfig, Extractor protocol)
  - source.py - Roff-based extractor (via roff_parser.py)
  - mandoc.py - Mandoc-based extractor (via tree_parser.py)
  - hybrid.py - Hybrid extractor: mandoc with LLM fallback
  - runner.py - Execution orchestration (sequential, parallel, batch)
  - common.py - Shared metadata assembly for all extractors
  - postprocess.py - Extractor-agnostic option post-processing
  - llm/ - LLM-based extraction subpackage
    - extractor.py - LLM extractor orchestration
    - prompt.py - Prompt construction
    - response.py - LLM response parsing
    - text.py - Man page text preparation and chunking
    - providers/ - LLM provider implementations (OpenAI, Gemini, LiteLLM fallback)
- web/views.py - Flask routes with URL-based distro/release routing
tools/ - Standalone scripts
- llm_bench.py - LLM extractor benchmark tool (run/compare metrics reports)
- fetch_manned.py - Fetch man pages from manned.org weekly dump
- mandoc-md - Custom mandoc binary with markdown output support
tests/ - Unit tests (test_*.py), fixtures
tests/e2e/ - Playwright e2e tests, snapshots, and dedicated e2e.db
tests/regression/ - Parsing regression tests and manpage .gz fixtures
runserver.py - Flask app entry point
manpages/ - Git submodule (explainshell-manpages)
- ubuntu-manpages-operator/ - Go pipeline that fetches Ubuntu .deb packages, extracts manpages, and converts them to markdown

Architecture

Man Page Processing Pipeline

manager.py orchestrates: raw .gz → parse → extract options → store in SQLite.

The CLI uses subcommands. Most commands require a database path, set via DB_PATH env var or --db <path>. Commands that don't need a database (e.g. extract --dry-run, diff extractors) work without it. Main commands:

extract --mode <mode> [options] files... — Extract options from manpages and store in DB
diff db --mode <mode> files... — Diff fresh extraction against the database
diff extractors <A..B> files... — Compare two extractors head-to-head
show {manpage,distros,sections,manpages,mappings,stats} — Query the database
db-check — Run database integrity checks

Extraction modes (passed via --mode to extract or diff db):

source - Parses roff macros directly via roff_parser.py + extraction/source.py
mandoc - Uses mandoc -T tree parser via extraction/mandoc.py
llm:<provider/model> - Sends man page text to an LLM (e.g., llm:openai/gpt-5-mini, llm:azure/my-deployment). Supports Gemini, OpenAI, Azure OpenAI, and LiteLLM (fallback) providers. For azure/..., the model suffix is the Azure deployment name and requires AZURE_OPENAI_API_KEY plus either AZURE_OPENAI_BASE_URL or AZURE_OPENAI_ENDPOINT.
hybrid:<provider/model> - Tries mandoc first, falls back to LLM on low confidence

Extract flags: --overwrite, --filter-db <spec> (conditional overwrite; requires --overwrite; same syntax as --mode minus hybrid), --dry-run, --debug, --drop, -j/--jobs <int> (parallel extraction, default 1), --batch <int> (provider batch API). All run output (logs, debug artifacts, manifests) goes to logs/{timestamp}/.

Data Model (models.py, store.py)

SQLite with two tables:

manpage - source (unique basename), name, synopsis, options (JSON), aliases, flags
mapping - command name → manpage id lookup (many-to-one, with score for preference)

Key classes (Pydantic models in models.py):

Option - text, short/long flag lists, has_argument, positional, nested_cmd
ParsedManpage - container with options/positionals properties and find_option(flag) lookup

Command Matching (matcher.py)

Uses bashlex AST visitor pattern:

Matcher inherits from bashlex.ast.nodevisitor
visitcommand() - looks up man page, handles multi-command (e.g., git commit)
visitword() - matches tokens to options (exact match, then fuzzy split for combined short flags like -abc)
Produces MatchResult(start, end, text, match) where start/end are character positions in the original string

E2E Tests

Hermetic setup: uses a dedicated tests/e2e/e2e.db and random port selection. Server is started fresh per run (reuseExistingServer: false).

Deployment

The app is deployed to Fly.io with two machines in the iad (Virginia) region. The SQLite database is baked into the Docker image at build time (downloaded as .zst from the GitHub release, decompressed during docker build).

Production infrastructure:

Domain: explainshell.com → Cloudflare (orange cloud proxy) → Fly.io
Cloudflare: DNS + proxy, SSL mode set to Full (Strict)
Fly app: explainshell — VM size, region, and machine config are in fly.toml
Direct origin access: The .fly.dev hostname is disabled (auto_assign_hostname = false) so all traffic must pass through Cloudflare. Use fly proxy 8080 to reach the origin directly for debugging.
Old DigitalOcean box: 174.138.81.104 (New Jersey) — kept as fallback; rollback = point Cloudflare DNS back to this IP

Latency baseline (Cloudflare → origin TTFB for /explain): ~140ms average.

Deploy code changes:

fly deploy

Update the database:

make upload-live-db (uploads a date-stamped explainshell-{date}.db.zst to the GitHub release)
fly deploy (rebuilds the image with the new DB)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Instructions — explainshell

Tech Stack

Workflow Requirements

LLM Benchmarking

Code Style

Environment

Common Commands

Project Structure

Architecture

Man Page Processing Pipeline

Data Model (models.py, store.py)

Command Matching (matcher.py)

E2E Tests

Deployment

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Project Instructions — explainshell

Tech Stack

Workflow Requirements

LLM Benchmarking

Code Style

Environment

Common Commands

Project Structure

Architecture

Man Page Processing Pipeline

Data Model (models.py, store.py)

Command Matching (matcher.py)

E2E Tests

Deployment