A web tool that parses man pages and explains command-line arguments by matching each argument to its help text.
- Python 3.12, Flask, SQLite, bashlex, OpenAI SDK, Google Gemini SDK, LiteLLM (fallback)
- Linting: ruff (Python), biome (JS)
- Testing: pytest (unit + doctests + parsing regression), JS Playwright Test (e2e)
- Dependencies:
requirements.txt(main),package.json(Playwright e2e)
Before finishing any task, always:
- Run
make format - Run tests — choose the right suite based on what changed:
make tests-quick(lint + unit + parsing regression) — use when changes clearly cannot affect what the web app serves (e.g., extraction pipeline, CLI tooling, tests themselves)make tests-all(lint + unit + e2e + parsing regression) — use when changes might affect the web serving path (rendering, matching, storage, templates, static assets, config)- When in doubt, run
make tests-all - If e2e tests fail due to snapshot diffs, assess whether the diff is expected, and get user confirmation before running
make e2e-update
- Update README.md if the change adds/removes/renames CLI commands, env vars, or user-facing features
- Update AGENTS.md if the change affects structure, convention, workflow, etc.
- Provide a draft commit message using Conventional Commits format
Use the benchmark tool (tools/llm_bench.py) to compare before/after metrics when making changes to the LLM extractor. It runs extraction on a default 10-file corpus (tests/regression/llm-bench/manpages/) and produces a JSON report with aggregate metrics: extracted files, failed files, total options, zero-option pages, multi-chunk pages, and token usage. Each run creates a timestamped directory under tests/regression/llm-bench/ containing the report and raw LLM responses (prompts and response text per chunk) for post-hoc investigation. Reports include git metadata (commit, dirty state).
Each run accepts an optional -d "..." to label what this run represents. When running benchmarks, always provide a description inferred from context — e.g. the task you're working on, the nature of local changes, or "baseline (clean)" for a pre-change run. This makes list and compare output self-explanatory.
Workflow for code changes (API, prompt, chunking, post-processing):
# 1. Stash your changes to get a clean baseline
git stash push -- explainshell/extraction/llm/
# 2. Run benchmark on the old code
python tools/llm_bench.py run --model openai/gpt-5-mini --batch 50 -d "baseline before <short summary of change>"
# 3. Restore your changes
git stash pop
# 4. Run benchmark on the new code
python tools/llm_bench.py run --model openai/gpt-5-mini --batch 50 -d "<short summary of change>"
# 5. Compare the two most recent reports
python tools/llm_bench.py compareUsage:
# Run on the default corpus (auto-saves to report directory)
python tools/llm_bench.py run --model openai/gpt-5-mini
# Run with batch API
python tools/llm_bench.py run --model openai/gpt-5-mini --batch 50
# Run on specific files
python tools/llm_bench.py run --model openai/gpt-5-mini path/to/file.1.gz
# Save to a specific directory instead of the default report directory
python tools/llm_bench.py run --model openai/gpt-5-mini -o runs/my-test/report.json tests/regression/manpages/
# Compare the two most recent reports
python tools/llm_bench.py compare
# Compare two specific reports
python tools/llm_bench.py compare report1.json report2.json
# List all reports
python tools/llm_bench.py list- Use Python type annotations on all new code (function signatures, return types, and non-obvious variables). Do not retroactively annotate existing code unless you are already modifying it.
- Python virtualenv: repo-local
.venv - CRITICAL: Every Bash tool call runs in a fresh shell with NO venv active. You MUST prefix every Python/pip/pytest/ruff/make command with
source .venv/bin/activate &&. Example:source .venv/bin/activate && make tests. Never run barepython,pytest,ruff,pip, ormakewithout activating first.
# Run unit tests + doctests (excludes e2e)
make tests
# Run a single test file
pytest tests/test_matcher.py -v
# Run a single test method
pytest tests/test_matcher.py::test_matcher::test_no_options -v
# Lint
make lint
# Format
make format
# Run e2e tests (requires playwright)
make e2e
# Update e2e snapshots
make e2e-update
# Run LLM integration test (requires API key in .env)
make test-llm
# Run parsing regression tests (requires DB)
make parsing-regression
# Update DB to accept current parser output for regression manpages
make parsing-update
# Run quick tests (unit + parsing regression, no e2e)
make tests-quick
# Run all tests (unit + e2e + parsing regression)
make tests-all
# Run DB integrity checks
make db-check
# Run web server locally
make serve
# Generate Ubuntu manpage archive (requires Go)
make ubuntu-archive UBUNTU_RELEASE=resolute
# Generate Arch Linux manpage archive (requires manned.org dump)
make arch-archive
# Process a man page into the database
python -m explainshell.manager extract --mode source /path/to/manpage.1.gzexplainshell/- Main packagemanager.py- CLI entry point for man page processing (python -m explainshell.manager <command>)db_check.py- Database integrity checks (used bymanager.py db-check)matcher.py- Core logic: walks bash AST and matches tokens to help textmodels.py- Core domain types (Option, ParsedManpage, RawManpage) as Pydantic/dataclass modelsstore.py- SQLite storage layererrors.py- Exception hierarchy (ProgramDoesNotExist, DuplicateManpage, InvalidSourcePath, ExtractionError, SkippedExtraction, LowConfidenceError)diff.py- Man page comparison and diff formattingtree_parser.py- Mandoc -T tree output parser with confidence assessmentroff_parser.py- Roff macro parser (man/mdoc dialects)roff_utils.py- Roff source detection (dashless opts, nested cmd)manpage.py- Man page reading and HTML conversionhelp_constants.py- Shell constant definitions for help textutil.py- Shared utilities (group_continuous, Peekable, name_section)config.py- Configuration (DB_PATH, HOST_IP, DEBUG, MANPAGE_URLS)extraction/- Man page option extraction pipeline__init__.py- Public API:make_extractor(mode)factorytypes.py- Shared types (ExtractionResult, ExtractionStats, BatchResult, ExtractorConfig, Extractor protocol)source.py- Roff-based extractor (viaroff_parser.py)mandoc.py- Mandoc-based extractor (viatree_parser.py)hybrid.py- Hybrid extractor: mandoc with LLM fallbackrunner.py- Execution orchestration (sequential, parallel, batch)common.py- Shared metadata assembly for all extractorspostprocess.py- Extractor-agnostic option post-processingllm/- LLM-based extraction subpackageextractor.py- LLM extractor orchestrationprompt.py- Prompt constructionresponse.py- LLM response parsingtext.py- Man page text preparation and chunkingproviders/- LLM provider implementations (OpenAI, Gemini, LiteLLM fallback)
web/views.py- Flask routes with URL-based distro/release routing
tools/- Standalone scriptsllm_bench.py- LLM extractor benchmark tool (run/compare metrics reports)fetch_manned.py- Fetch man pages from manned.org weekly dumpmandoc-md- Custom mandoc binary with markdown output support
tests/- Unit tests (test_*.py), fixturestests/e2e/- Playwright e2e tests, snapshots, and dedicatede2e.dbtests/regression/- Parsing regression tests and manpage .gz fixturesrunserver.py- Flask app entry pointmanpages/- Git submodule (explainshell-manpages)ubuntu-manpages-operator/- Go pipeline that fetches Ubuntu.debpackages, extracts manpages, and converts them to markdown
manager.py orchestrates: raw .gz → parse → extract options → store in SQLite.
The CLI uses subcommands. Most commands require a database path, set via DB_PATH env var or --db <path>. Commands that don't need a database (e.g. extract --dry-run, diff extractors) work without it. Main commands:
extract --mode <mode> [options] files...— Extract options from manpages and store in DBdiff db --mode <mode> files...— Diff fresh extraction against the databasediff extractors <A..B> files...— Compare two extractors head-to-headshow {manpage,distros,sections,manpages,mappings,stats}— Query the databasedb-check— Run database integrity checks
Extraction modes (passed via --mode to extract or diff db):
source- Parses roff macros directly viaroff_parser.py+extraction/source.pymandoc- Uses mandoc -T tree parser viaextraction/mandoc.pyllm:<provider/model>- Sends man page text to an LLM (e.g.,llm:openai/gpt-5-mini,llm:azure/my-deployment). Supports Gemini, OpenAI, Azure OpenAI, and LiteLLM (fallback) providers. Forazure/..., the model suffix is the Azure deployment name and requiresAZURE_OPENAI_API_KEYplus eitherAZURE_OPENAI_BASE_URLorAZURE_OPENAI_ENDPOINT.hybrid:<provider/model>- Tries mandoc first, falls back to LLM on low confidence
Extract flags: --overwrite, --filter-db <spec> (conditional overwrite; requires --overwrite; same syntax as --mode minus hybrid), --dry-run, --debug, --drop, -j/--jobs <int> (parallel extraction, default 1), --batch <int> (provider batch API). All run output (logs, debug artifacts, manifests) goes to logs/{timestamp}/.
SQLite with two tables:
- manpage - source (unique basename), name, synopsis, options (JSON), aliases, flags
- mapping - command name → manpage id lookup (many-to-one, with score for preference)
Key classes (Pydantic models in models.py):
Option- text, short/long flag lists, has_argument, positional, nested_cmdParsedManpage- container with options/positionals properties andfind_option(flag)lookup
Uses bashlex AST visitor pattern:
Matcherinherits frombashlex.ast.nodevisitorvisitcommand()- looks up man page, handles multi-command (e.g.,git commit)visitword()- matches tokens to options (exact match, then fuzzy split for combined short flags like-abc)- Produces
MatchResult(start, end, text, match)where start/end are character positions in the original string
Hermetic setup: uses a dedicated tests/e2e/e2e.db and random port selection. Server is started fresh per run (reuseExistingServer: false).
The app is deployed to Fly.io with two machines in the iad (Virginia) region. The SQLite database is baked into the Docker image at build time (downloaded as .zst from the GitHub release, decompressed during docker build).
Production infrastructure:
- Domain:
explainshell.com→ Cloudflare (orange cloud proxy) → Fly.io - Cloudflare: DNS + proxy, SSL mode set to Full (Strict)
- Fly app:
explainshell— VM size, region, and machine config are infly.toml - Direct origin access: The
.fly.devhostname is disabled (auto_assign_hostname = false) so all traffic must pass through Cloudflare. Usefly proxy 8080to reach the origin directly for debugging. - Old DigitalOcean box:
174.138.81.104(New Jersey) — kept as fallback; rollback = point Cloudflare DNS back to this IP
Latency baseline (Cloudflare → origin TTFB for /explain): ~140ms average.
Deploy code changes:
fly deployUpdate the database:
make upload-live-db(uploads a date-stampedexplainshell-{date}.db.zstto the GitHub release)fly deploy(rebuilds the image with the new DB)