fix(cli): bundle MarkItDown in workspace installs + diagnose missing converter#469
fix(cli): bundle MarkItDown in workspace installs + diagnose missing converter#469Jurij89 wants to merge 2 commits into
Conversation
…converter Resolves #467. Workspace/dev `pnpm install` now stages the MarkItDown binary so PDF/DOCX/etc. extraction works out of the box, matching the published-install experience. Postinstall flow (packages/cli/scripts/bundle-markitdown-binaries.mjs): - Workspace checkouts attempt the release-asset download by default (previously a silent skip). On HTTP 404 — common when the local package.json version is ahead of the latest published tag — fall back to the newest non-draft release and rewrite the meta sidecar to reflect the local cliVersion (preserving the actual release tag in an `effectiveTag` field) so the daemon's converter check accepts the binary instead of treating it as a stale fingerprint mismatch. - Honor DKG_SKIP_MARKITDOWN_DOWNLOAD=1 opt-out and skip silently in CI environments to avoid burning CI quota on implicit downloads. - Wrap the `python3 -m venv` step in `markitdown:build` so missing python3-venv / Python emits a per-OS install hint instead of an opaque `Command failed` error. - New `--verify-release-artifacts <dir>` mode plus a release.yml step that fails the workflow when any matrix-built binary or its sidecar is missing/mismatched before `softprops/action-gh-release` publishes. Runtime diagnostic surfaces (issue #467 acceptance criterion 5): - /api/status now returns `extractionPipelines: string[]` and `warnings: [{code, message, remediation?}]`. When MarkItDown is missing the response carries a `markitdown_unavailable` warning with the remediation command. Surfaces via the existing `dkg_status` MCP tool so agents see the gap programmatically. - /.well-known/skill.md gains an `Unavailable extraction pipelines` line when MarkItDown is absent, listing the MIME types and the remediation command. - Static SKILL.md template points agents at `dkg_status` for the live pipeline list (templates copied to OpenClaw/Hermes home folders retain `(dynamic)` placeholders). - Daemon startup warning now tells operators to run `pnpm --filter @origintrail-official/dkg run markitdown:bundle` and restart, mirroring the bundle script's own success log. Tests: 14 new vitest cases in packages/cli/test/markitdown-binaries.test.ts cover the implicit current-platform flow (workspace download, already- staged skip, version-ahead fallback, total-failure remediation, opt-out env var, CI skip), the venv error rewriter, and the new release-artifact verifier. Defensive: packages/cli/bin/ added to .gitignore with a .gitkeep so the directory stays present for the bundler to write into. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| } | ||
|
|
||
| const result = await ensureCurrentPlatformBinary({ | ||
| await bundleImplicitCurrentPlatform({ |
There was a problem hiding this comment.
🔴 Bug: bundleImplicitCurrentPlatform() now turns download problems into { status: 'failed' } instead of throwing, but this caller drops the result. In --current-platform mode that means the script exits 0 even when staging failed, unless --best-effort happens to be set. Please check the returned status here and throw on failure/unexpected outcomes so explicit runs still fail fast.
There was a problem hiding this comment.
Fixed in cc8ab93. main() now inspects bundleImplicitCurrentPlatform's return value and throws when status === 'failed' unless --best-effort is set, so explicit pnpm run markitdown:bundle (no --best-effort) exits 1. Verified locally with --release-repo this-org-does-not-exist/dkg-fake: remediation message printed, then the new wrapper error, exit code 1. Postinstall (which sets --best-effort) still exits 0 with the warning, unchanged.
| message: | ||
| "MarkItDown binary not registered — PDF, DOCX, PPTX, XLSX, CSV, HTML, EPUB, and XML extraction unavailable. Imports of those types will return extraction.status='skipped' and the file will be stored as a blob.", | ||
| remediation: | ||
| "Run `pnpm --filter @origintrail-official/dkg run markitdown:bundle`, then restart the daemon.", |
There was a problem hiding this comment.
🔴 Bug: this remediation tells operators to run pnpm --filter @origintrail-official/dkg run markitdown:bundle, but packages/cli/package.json only defines markitdown:build today. Following the warning will hit a missing-script error and leave extraction unavailable. Either add a real markitdown:bundle script or change this and the matching log/skill text to a command that actually exists.
There was a problem hiding this comment.
Fixed in cc8ab93. Added "markitdown:bundle": "node ./scripts/bundle-markitdown-binaries.mjs --current-platform --force" to packages/cli/package.json. pnpm --filter @origintrail-official/dkg run markitdown:bundle now resolves, downloads the binary (using the workspace fallback if the local version is ahead of the latest published tag), and exits non-zero on failure — verified locally.
| } | ||
| }); | ||
|
|
||
| it('workspace + 404 + injected latest URL stages the fallback binary', async () => { |
There was a problem hiding this comment.
🟡 Issue: this test title/comment says the fallback binary is staged, but the assertions only verify the failure path (status === 'failed'). That leaves the new fallback-staged branch effectively untested and makes the coverage misleading. Either exercise the happy path here or rename the test/comments to match the behavior being asserted.
There was a problem hiding this comment.
Fixed in cc8ab93. Added an injectable buildReleaseUrl(version, repo) parameter to bundleImplicitCurrentPlatform so tests can redirect BOTH the primary and fallback URLs to a local mock server (previously the fallback path called releaseBaseUrl directly and hit github.com). Removed the weak-assertion variant. The new test workspace + version-tag 404 + injected URL builder stages the fallback binary with rewritten meta exercises the full happy path: 404 on version tag → fetchLatestTag returns the latest → fallback downloads → status === 'fallback-staged', binary present on disk, and (critical for issue #467) meta sidecar reports cliVersion: LOCAL with effectiveTag: LATEST_TAG so the daemon's converter check accepts it.
…n:bundle script, real fallback test Three Codex review findings on PR #469: 1. Explicit `--current-platform` runs (without --best-effort) used to exit 0 even when staging failed because main() dropped the result of bundleImplicitCurrentPlatform(). Now main() inspects the result and throws when status === 'failed' so the IIFE error handler surfaces a non-zero exit, while postinstall (which sets --best-effort) keeps its swallowing behavior. Verified by running with a bogus --release-repo: remediation message printed, exit 1. 2. Remediation messages told operators to run `markitdown:bundle` but packages/cli/package.json defined only `markitdown:build`. Added the missing script — `node ./scripts/bundle-markitdown-binaries.mjs --current-platform --force` — so `pnpm --filter @origintrail-official/dkg run markitdown:bundle` resolves and actually downloads the binary. 3. The earlier "workspace + 404 + injected latest URL stages the fallback binary" test only asserted the failure path (the local server could not redirect the fallback URL because release.mjs called github.com directly). Added an injectable `buildReleaseUrl` parameter on bundleImplicitCurrentPlatform, removed the weak-assertion variant, and the new test exercises the full happy path including the critical meta-rewrite invariant (meta.cliVersion = LOCAL version, effectiveTag = actual release tag — otherwise the daemon's converter check rejects the binary). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
pnpm installnow stages the MarkItDown binary by default so PDF/DOCX/PPTX/XLSX/CSV/HTML/EPUB/XML extraction works out of the box, matching the published-install experience. Previously the postinstall silently skipped the release-asset download for source checkouts and PDFs returnedextraction.status='skipped'.package.jsonversion is ahead of the latest published tag (the common pre-release case), the bundler falls back to the newest non-draft release and rewrites the meta sidecar with the localcliVersion(preserving the release tag in aneffectiveTagfield) so the daemon's converter check accepts the binary instead of rejecting it as a fingerprint mismatch./api/statuscarriesextractionPipelines: string[]+warnings: [{code, message, remediation?}], the live/.well-known/skill.mdadds anUnavailable extraction pipelinesline, the staticSKILL.mdtemplate points agents at thedkg_statustool, and the daemon startup warning tells operators exactly which command to run and that a restart is required.DKG_SKIP_MARKITDOWN_DOWNLOAD=1opt-out, automatic silent skip in CI, per-OS install hint whenpython3-venvis missing frommarkitdown:build, and a release-time--verify-release-artifactscheck inrelease.ymlthat fails the workflow if any matrix-built binary or sidecar is missing/mismatched beforesoftprops/action-gh-releasepublishes.Related
skippedresponse with a structured reason field, and loosening the OpenClaw attachment-ref normalizer to allow non-completedstatuses so agents see the skip reasonFiles changed
packages/cli/scripts/bundle-markitdown-binaries.mjsbundleImplicitCurrentPlatform()workflow with workspace fallback to latest tag,fetchLatestReleaseTag()(uses/releaseslisting —/releases/latestreturns 404 on prerelease-only repos),verifyReleaseArtifacts(),rewriteVenvError(), restart-hint logging, opt-out env vars. Removed the silent workspace-skip short-circuit.packages/cli/src/daemon/routes/status.tsextractionPipelinesandwarnings[]on/api/status;unavailablePipelinespassed tobuildSkillMdfor the live skill manifest.packages/cli/src/daemon/manifest.tsbuildSkillMdacceptsunavailablePipelines, renders the diagnostic line, placeholder match updated in lockstep with the template wording.packages/cli/src/daemon/lifecycle.tspnpm --filter @origintrail-official/dkg run markitdown:bundle+ restart-the-daemon remediation.packages/cli/skills/dkg-node/SKILL.mddkg_statustool (matching the template's tool-first convention) instead of raw HTTP for the live pipeline list..github/workflows/release.yml--verify-release-artifacts) runs beforesoftprops/action-gh-releaseand fails the workflow if any matrix-built binary or sidecar is missing/mismatched..gitignore+packages/cli/bin/.gitkeeppackages/cli/test/markitdown-binaries.test.tsTest plan
pnpm exec vitest run test/markitdown-binaries.test.ts test/extraction-markitdown.test.ts test/document-processor-e2e.test.ts→ 50 passed, 4 skipped (the skipIf'd e2e cases need a real converter; follow-up could remove the guards now that the workspace flow actually stages one).pnpm exec tsc --noEmitinpackages/cli/→ clean.pnpm installstagesmarkitdown-win32-x64.exevia fallback tov10.0.0-rc.4;bin/markitdown-win32-x64.exe.meta.jsonreportscliVersion: "10.0.0-rc.6", effectiveTag: "v10.0.0-rc.4";extraction-markitdown.test.tsno longer logs the "Ignoring bundled binary with incompatible metadata sidecar" warning that fires before this PR.slot-helpers,migration,rollback,blue-green-integration,auto-update, etc. on the dev machine are pre-existing and unrelated to these files.curl /api/status | jq '.extractionPipelines, .warnings'returns the full pipeline list and[]. With binary deleted and daemon restarted, returns justtext/markdownand a singlemarkitdown_unavailablewarning withremediation.curl /.well-known/skill.md | grep -i 'extraction pipelines'shows theAvailableline when the binary is present, and BOTH anAvailableand anUnavailableline (listing the 9 MIME types) when it is missing.POST /api/assertion/test-pdf/import-file→ response carriesextraction.status: "completed",pipelineUsed: "application/pdf",tripleCount > 0, and anmdIntermediateHash.🤖 Generated with Claude Code