Skip to content

fix(core): scope graph exports to selected snapshot pages#104

Open
Noctivoro wants to merge 1 commit into
Crawlith:mainfrom
Noctivoro:fix/snapshot-scoped-page-exports
Open

fix(core): scope graph exports to selected snapshot pages#104
Noctivoro wants to merge 1 commit into
Crawlith:mainfrom
Noctivoro:fix/snapshot-scoped-page-exports

Conversation

@Noctivoro

Copy link
Copy Markdown

Summary

Thank you for maintaining Crawlith — I ran into a small snapshot/export edge case while using Crawlith for SEO internal-link graphing and opened #103 with the reproduction details.

This PR scopes page loading for non-single snapshots to pages whose last_seen_snapshot_id matches the selected snapshot. That keeps graph exports aligned with the current crawl's normalization policy instead of including pages that were only seen in older snapshots.

Fixes #103.

Problem

If a site is first crawled with query strings preserved and later crawled with --no-query, the latest export can still include older query-URL nodes such as:

/book-a-call?intent=header-cta
/free-audit?intent=home-hero

The crawler is normalizing new discoveries correctly, but loadGraphFromSnapshot() ultimately relies on page repository snapshot queries that include pages first seen in older snapshots. That makes --no-query appear ineffective in exported graph nodes.

Changes

  • Scope getPagesBySnapshot() to p.last_seen_snapshot_id = ? for non-single snapshots.
  • Scope getPagesIteratorBySnapshot() the same way for graph loading/export.
  • Scope getPagesIdentityBySnapshot() to the current snapshot so edge materialization uses the selected snapshot's page set.
  • Preserve existing single snapshot behavior, which is still metrics-scoped.
  • Add a DB-layer regression test covering stale query URL exclusion on a later snapshot.
  • Add a changeset for @crawlith/core.

Validation

Ran locally:

pnpm run lint
pnpm test
pnpm build

Also verified manually against a real crawl workflow:

  1. Created a dirty snapshot with query URLs preserved.
  2. Reran the same site with --no-query without cleaning the DB.
  3. Exported the latest graph.
  4. Confirmed queryUrlNodes: 0 in the exported graph.json.

Thanks again — happy to adjust the query semantics if you'd prefer a different snapshot-scoping approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Graph exports can retain stale query URL nodes after rerunning with --no-query

1 participant