Summary
When rerunning a site crawl with a different URL normalization policy, stale URLs from previous snapshots can still appear in graph exports for the latest snapshot.
A concrete example is rerunning a crawl with --no-query: query-string variants such as /book-a-call?intent=... can continue to appear as nodes in the exported graph if they were discovered in an earlier snapshot.
Reproduction
-
Crawl a site where internal links include query-string CTA URLs:
crawlith crawl https://example.com/ --limit 60 --depth 2
-
Rerun the same site with query stripping enabled:
crawlith crawl https://example.com/ --limit 60 --depth 2 --no-query
-
Export the latest graph:
crawlith export https://example.com/ --export json,csv --output /tmp/crawlith-export
-
Inspect graph.json or nodes.csv.
Expected behavior
The latest snapshot export should reflect the latest crawl's normalization policy. If the latest crawl used --no-query, graph nodes for query-string URL variants should not remain from older snapshots.
Actual behavior
Older query-string URL variants can remain in the exported graph because snapshot page loading includes pages first seen in earlier snapshots, even when they were not seen in the selected/latest snapshot.
Why this matters
For SEO/internal-link graphing, stale query nodes make visualizations and link metrics noisy. They can inflate orphan/low-inlink counts and make --no-query appear ineffective even though the crawler is normalizing newly discovered links correctly.
Proposed fix
Scope snapshot page queries to pages whose last_seen_snapshot_id matches the selected snapshot, while preserving single snapshot behavior. This keeps graph exports aligned to the current crawl rather than all historical pages for the site.
I'm happy to open a small PR with a repository-level test covering this behavior.
Summary
When rerunning a site crawl with a different URL normalization policy, stale URLs from previous snapshots can still appear in graph exports for the latest snapshot.
A concrete example is rerunning a crawl with
--no-query: query-string variants such as/book-a-call?intent=...can continue to appear as nodes in the exported graph if they were discovered in an earlier snapshot.Reproduction
Crawl a site where internal links include query-string CTA URLs:
Rerun the same site with query stripping enabled:
Export the latest graph:
crawlith export https://example.com/ --export json,csv --output /tmp/crawlith-exportInspect
graph.jsonornodes.csv.Expected behavior
The latest snapshot export should reflect the latest crawl's normalization policy. If the latest crawl used
--no-query, graph nodes for query-string URL variants should not remain from older snapshots.Actual behavior
Older query-string URL variants can remain in the exported graph because snapshot page loading includes pages first seen in earlier snapshots, even when they were not seen in the selected/latest snapshot.
Why this matters
For SEO/internal-link graphing, stale query nodes make visualizations and link metrics noisy. They can inflate orphan/low-inlink counts and make
--no-queryappear ineffective even though the crawler is normalizing newly discovered links correctly.Proposed fix
Scope snapshot page queries to pages whose
last_seen_snapshot_idmatches the selected snapshot, while preservingsinglesnapshot behavior. This keeps graph exports aligned to the current crawl rather than all historical pages for the site.I'm happy to open a small PR with a repository-level test covering this behavior.