Skip to content

web, api: agent-readiness discovery and proper 404s#518

Open
snormore wants to merge 3 commits into
mainfrom
snor/agent-ready-discovery
Open

web, api: agent-readiness discovery and proper 404s#518
snormore wants to merge 3 commits into
mainfrom
snor/agent-ready-discovery

Conversation

@snormore

@snormore snormore commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Summary of Changes

  • Add the standard agent-readiness discovery surface so the site stops scoring Level 0 on https://isitagentready.com: robots.txt, sitemap.xml, /.well-known/mcp/server-card.json, and llms.txt. After deploy the PR preview scores Level 2 (Bot-Aware).
  • Server card and llms.txt describe the MCP endpoint at /api/mcp and enumerate its tools (execute_sql, execute_cypher, get_schema, read_docs), resources, and prompts so agents can discover them without speaking MCP first.
  • Declare Content Signals in robots.txtsearch=yes, ai-input=yes, ai-train=yes. Policy default is permissive since the content is public network telemetry and the platform's posture is agent-friendly; flip ai-train to no if that changes.
  • Sitemap covers the 41 user-reachable pages (sidebar nav + tab/popover links). Hidden routes (/query, /timeline, /dz/ledger, /dz/shreds/funders) are excluded; /query/:id, /chat/:id, and /settings are disallowed in robots.
  • Fix the Go spaHandler to 404 any missing /.well-known/* path instead of falling through to the SPA shell. Previously extensionless agent-discovery probes (api-catalog, openid-configuration, oauth-protected-resource, ucp, web-bot-auth directory) were soft-404s returning 200 HTML, which made the scanner flag them as mis-configured rather than simply missing.
  • Fix the local dev nginx config to return real 404s for missing extension-bearing paths (matching the Go handler's behavior), and mark /api/ with ^~ so the new regex can never shadow API requests with file-shaped paths.

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 1 +8 / -0 +8
Scaffolding 5 +128 / -3 +125
Tests 1 +62 / -0 +62
Total 7 +198 / -3 +195

Mostly static discovery files and config; the one behavior change is the eight-line /.well-known/* 404 branch in spaHandler.

Key files (click to expand)
  • api/main.gospaHandler now 404s missing /.well-known/* paths instead of returning the SPA shell.
  • api/main_test.go — New TestSpaHandler covering existing files, SPA fallback, whitelisted asset 404s, and both extension-bearing and extensionless /.well-known/* 404s (9 cases).
  • web/public/llms.txt — Markdown overview of the MCP server, tools, resources, prompts, and main web app sections.
  • web/public/sitemap.xml — 41 canonical URLs covering every page reachable from the UI.
  • web/public/.well-known/mcp/server-card.json — Standard MCP discovery document pointing agents at /api/mcp.
  • k8s/docker/nginx.conf — Regex location 404s missing extension-bearing paths; ^~ on /api/ keeps proxy intact.
  • web/public/robots.txt — Content Signals line plus allow/disallow rules and sitemap pointer.

Testing Verification

  • Added TestSpaHandler with 9 subtests covering existing files, SPA fallback, whitelisted static-asset 404s, and /.well-known/* 404s (both extension-bearing and extensionless). All pass locally.
  • Validated the nginx config syntax (nginx -t) in an alpine container with the updated file mounted, then spun up nginx against web/dist and confirmed:
    • /, /status, /dz/devices/ABC123 → 200 SPA shell.
    • /robots.txt, /sitemap.xml, /.well-known/mcp/server-card.json, /llms.txt → 200 with the right content-types (text/plain, text/xml, application/json, text/plain).
    • /nope.json, /missing-asset.css → 404 (was 200 HTML before).
  • Ran the isitagentready.com scanner against the deployed PR preview after each commit. Progression: Level 0 → Level 1 (robots/sitemap/MCP card land) → Level 2 (Content Signals). The 5 paths that previously soft-404'd (/.well-known/api-catalog, openid-configuration, oauth-protected-resource, ucp, web-bot-auth dir) all report clean "not found" now.
  • Cross-checked every URL in sitemap.xml against the React source to confirm each has at least one non-Route reference (sidebar, tab, popover, or in-page link); removed entries that didn't.

Adds standard agent-readiness discovery surface:
- robots.txt with sitemap pointer
- sitemap.xml of 41 user-reachable pages
- /.well-known/mcp/server-card.json pointing at /api/mcp
- llms.txt summarizing the MCP tools, resources, and site sections

Updates nginx so paths with file extensions return real 404s instead of
falling through to the SPA shell (was a soft-404 for every missing asset),
and short-circuits regex evaluation for /api/ to keep proxy paths intact.
@github-actions

github-actions Bot commented Apr 23, 2026

Copy link
Copy Markdown

🔗 Preview: https://pr-518.data.malbeclabs.com

The Go spaHandler whitelisted a fixed set of static extensions to 404 on
miss, and fell back to index.html for everything else. That meant
extensionless agent-discovery probes (/.well-known/api-catalog,
openid-configuration, oauth-protected-resource, etc.) returned 200 HTML —
a soft-404 that flagged the site as not agent-ready.

Anything under /.well-known/ is an RFC 8615 discovery endpoint and is
never a SPA route, so 404 missing files unconditionally.
Adds Cloudflare-style Content Signals declaring AI usage preferences:
search, ai-input (live RAG/assistants), and ai-train all set to 'yes' —
the data is public network telemetry and the platform's posture is
agent-friendly. Bumps the agent-readiness scanner to Level 2 (Bot-Aware).
@snormore snormore changed the title web, k8s: add agent discovery files and proper 404s web: add agent discovery files Apr 23, 2026
@snormore snormore changed the title web: add agent discovery files web, api, k8s: agent-readiness discovery and proper 404s Apr 23, 2026
@snormore snormore changed the title web, api, k8s: agent-readiness discovery and proper 404s web, api: agent-readiness discovery and proper 404s Apr 23, 2026

@ben-dz ben-dz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as we're okay with the site as an ai training data source.

@snormore snormore removed the preview label May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants