web, api: agent-readiness discovery and proper 404s#518
Open
snormore wants to merge 3 commits into
Open
Conversation
Adds standard agent-readiness discovery surface: - robots.txt with sitemap pointer - sitemap.xml of 41 user-reachable pages - /.well-known/mcp/server-card.json pointing at /api/mcp - llms.txt summarizing the MCP tools, resources, and site sections Updates nginx so paths with file extensions return real 404s instead of falling through to the SPA shell (was a soft-404 for every missing asset), and short-circuits regex evaluation for /api/ to keep proxy paths intact.
|
🔗 Preview: https://pr-518.data.malbeclabs.com |
The Go spaHandler whitelisted a fixed set of static extensions to 404 on miss, and fell back to index.html for everything else. That meant extensionless agent-discovery probes (/.well-known/api-catalog, openid-configuration, oauth-protected-resource, etc.) returned 200 HTML — a soft-404 that flagged the site as not agent-ready. Anything under /.well-known/ is an RFC 8615 discovery endpoint and is never a SPA route, so 404 missing files unconditionally.
Adds Cloudflare-style Content Signals declaring AI usage preferences: search, ai-input (live RAG/assistants), and ai-train all set to 'yes' — the data is public network telemetry and the platform's posture is agent-friendly. Bumps the agent-readiness scanner to Level 2 (Bot-Aware).
ben-dz
approved these changes
Apr 30, 2026
ben-dz
left a comment
Contributor
There was a problem hiding this comment.
LGTM as long as we're okay with the site as an ai training data source.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of Changes
robots.txt,sitemap.xml,/.well-known/mcp/server-card.json, andllms.txt. After deploy the PR preview scores Level 2 (Bot-Aware).llms.txtdescribe the MCP endpoint at/api/mcpand enumerate its tools (execute_sql,execute_cypher,get_schema,read_docs), resources, and prompts so agents can discover them without speaking MCP first.robots.txt—search=yes, ai-input=yes, ai-train=yes. Policy default is permissive since the content is public network telemetry and the platform's posture is agent-friendly; flipai-traintonoif that changes./query,/timeline,/dz/ledger,/dz/shreds/funders) are excluded;/query/:id,/chat/:id, and/settingsare disallowed in robots.spaHandlerto 404 any missing/.well-known/*path instead of falling through to the SPA shell. Previously extensionless agent-discovery probes (api-catalog,openid-configuration,oauth-protected-resource,ucp, web-bot-auth directory) were soft-404s returning 200 HTML, which made the scanner flag them as mis-configured rather than simply missing./api/with^~so the new regex can never shadow API requests with file-shaped paths.Diff Breakdown
Mostly static discovery files and config; the one behavior change is the eight-line
/.well-known/*404 branch inspaHandler.Key files (click to expand)
api/main.go—spaHandlernow 404s missing/.well-known/*paths instead of returning the SPA shell.api/main_test.go— NewTestSpaHandlercovering existing files, SPA fallback, whitelisted asset 404s, and both extension-bearing and extensionless/.well-known/*404s (9 cases).web/public/llms.txt— Markdown overview of the MCP server, tools, resources, prompts, and main web app sections.web/public/sitemap.xml— 41 canonical URLs covering every page reachable from the UI.web/public/.well-known/mcp/server-card.json— Standard MCP discovery document pointing agents at/api/mcp.k8s/docker/nginx.conf— Regex location 404s missing extension-bearing paths;^~on/api/keeps proxy intact.web/public/robots.txt— Content Signals line plus allow/disallow rules and sitemap pointer.Testing Verification
TestSpaHandlerwith 9 subtests covering existing files, SPA fallback, whitelisted static-asset 404s, and/.well-known/*404s (both extension-bearing and extensionless). All pass locally.nginx -t) in an alpine container with the updated file mounted, then spun up nginx againstweb/distand confirmed:/,/status,/dz/devices/ABC123→ 200 SPA shell./robots.txt,/sitemap.xml,/.well-known/mcp/server-card.json,/llms.txt→ 200 with the right content-types (text/plain,text/xml,application/json,text/plain)./nope.json,/missing-asset.css→ 404 (was 200 HTML before)./.well-known/api-catalog,openid-configuration,oauth-protected-resource,ucp, web-bot-auth dir) all report clean "not found" now.sitemap.xmlagainst the React source to confirm each has at least one non-Route reference (sidebar, tab, popover, or in-page link); removed entries that didn't.