feat(portal): optimize portal discovery for AI agents and crawlers#60
Merged
Conversation
Add a Content-Signal directive (draft-romm-aipref-contentsignals) to robots.txt to opt out of AI training while keeping classic search indexing and live agent retrieval enabled. - search=yes: keep visibility in Google/Bing - ai-train=no: data changes too frequently to be safely frozen in a training corpus - ai-input=yes: stay reachable for live RAG (Perplexity, ChatGPT browse, Claude web search)
Adds /.well-known/security.txt advertising a contact mailbox for security disclosures. Expires is recomputed at each request (now + 1 year) so the file never falls out of RFC compliance without a deploy. Served regardless of allowRobots / draft status — the security contact must remain reachable on any deployed portal.
Adds /.well-known/change-password redirecting to the simple-directory password-change flow. Used by password managers (1Password, Bitwarden, Apple Passwords, Chrome) to auto-navigate users when triggered from the saved-passwords UI. Returns 404 when authentication is disabled on the portal or when the portal is a draft — no point advertising a password flow that has no accounts to manage.
Adds /.well-known/api-catalog returning application/linkset+json. The linkset starts with the global data-fair API entry (service-desc: OpenAPI spec, service-doc: /catalog-api-doc, status: /ping, collection: /datasets) and then enumerates one entry per dataset published on the portal, each pointing to its own filtered OpenAPI spec and human-doc page. Each dataset is genuinely a distinct API surface (filtered actions, dedicated OpenAPI spec), so enumeration honours the RFC 9727 model of listing APIs rather than resources. Capped at 1000 entries (same limit as sitemap.xml). Gated by allowRobots and draft like sitemap / robots.txt — hidden portals return 404. Also adds e2e coverage in seo-indexing for the three new well-known endpoints (security.txt, change-password, api-catalog).
The robots.txt response was refactored to publish an explicit Allow-list of public sections followed by a fallback Disallow: / (so unknown paths are blocked rather than reachable). The legacy assertions still expected a single Allow: / with no Disallow, which always fails on the new output. Update the indexable-portal assertions to check for representative Allow rules (Allow: /$, Allow: /datasets) and drop the Disallow exclusion. On the hidden-portal side, additionally assert no Allow rule leaks through.
Align with sitemap.xml.ts which lets errors propagate to h3's default 500 handler instead of returning a partial linkset.
Surface https://github.com/data-fair as a stable Contact, plus the portal's /contact page when it has actually been configured (same mongo existence check as the sitemap route). The private contactInformations.email is intentionally never exposed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Content-Signalheader (search=yes, ai-train=no, ai-input=yes)/.well-known/security.txt(RFC 9116) with a 1-yearExpiresand two Contact lines:https://github.com/data-fair(always) and the portal's/contactpage when that page is actually configured (same mongo existence check as the sitemap route);contactInformations.emailis intentionally kept private/.well-known/change-password(W3C) — 302 to/simple-directory/login?action=changePasswordwhen authentication is enabled, 404 whenauthentication: 'none'or for drafts/.well-known/api-catalog(RFC 9727) asapplication/linkset+json, anchored on the data-fair root API + one entry per published dataset; 404 whenallowRobots: falseor for drafts; data-fair fetch errors propagate as 500 (aligned onsitemap.xml.ts)application-cardelevation/rounded fallback onportalConfig.defaultsto matchnews-card,reuse-card,dataset-metadata, etc..well-knownendpoints