feat(wayback): add collapse param to query_cdx for IA CDX deduplication#15
Open
prezis wants to merge 1 commit into
Open
feat(wayback): add collapse param to query_cdx for IA CDX deduplication#15prezis wants to merge 1 commit into
prezis wants to merge 1 commit into
Conversation
The Internet Archive CDX server natively supports a ``collapse=`` query
param that deduplicates adjacent rows on the server side. Two forms:
- ``collapse=urlkey`` — first snapshot per unique URL within the window
(the canonical "earliest snapshot per URL in this date range" query).
- ``collapse=timestamp:N`` — collapse adjacent rows where the first N
chars of timestamp match. ``timestamp:8`` = one row per YYYYMMDD,
``timestamp:6`` = one per YYYYMM, etc.
Without server-side collapse, callers do the year-widen-then-client-filter
dance: pull every snapshot in a wide window, dedupe in Python, throw most
of them away. Pushing this filter to IA cuts response sizes 10-100x for
"earliest per URL" workloads.
The param is OPTIONAL; default behaviour (collapse=None) is unchanged.
Field names are validated against the CDX schema (urlkey, timestamp,
original, statuscode, mimetype, digest, length); the optional :N must be
a positive integer.
Adds 7 unit tests in tests/test_wayback.py covering: omission default,
bare-field, ``field:N`` form, and 4 validation rejections (unknown field,
unknown field with N, non-int N, zero N).
Reference: https://archive.org/developers/wayback-cdx-server.html#collapsing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The Internet Archive CDX server natively supports a
collapse=query param that deduplicates adjacent rows on the server side. Two forms:collapse=urlkey— first snapshot per unique URL within the window. Canonical use: "earliest snapshot per URL in this date range".collapse=timestamp:N— collapse adjacent rows where the first N chars of timestamp match.timestamp:8= one row per YYYYMMDD,timestamp:6= one per YYYYMM, etc.Without server-side collapse, callers do the year-widen-then-client-filter dance: pull every snapshot in a wide window, dedupe in Python, throw most of them away. Pushing this filter to IA cuts response sizes 10-100x for "earliest per URL" workloads and is much friendlier to IA.
Real-world example: wojak-wojtek's
fetch_ism_pmi_historical.py:wayback_cdx_queryadapter does exactly this client-side filter today. With this PR it can passcollapse=\"urlkey\"and let IA do the work.Behaviour
collapse: str | None = None— optional, default unchanged. Existing callers see no difference.query_cdx(..., collapse=\"urlkey\")→ URL gains&collapse=urlkey.query_cdx(..., collapse=\"timestamp:8\")→ URL gains&collapse=timestamp%3A8(colon URL-quoted).Validation
_validate_collapserejects bad inputs at URL-build time so users get a PythonValueErrorwith a helpful message instead of an opaque CDX 500:{urlkey, timestamp, original, statuscode, mimetype, digest, length}(the CDX schema).:Ntail must be a positive integer.Reference
https://archive.org/developers/wayback-cdx-server.html#collapsing
Tests
Adds 7 unit tests in
tests/test_wayback.py:test_build_cdx_url_collapse_omitted_by_default— None default emits no clause.test_build_cdx_url_collapse_bare_field—\"urlkey\"producescollapse=urlkey.test_build_cdx_url_collapse_field_with_n—\"timestamp:8\"is preserved (URL-quoted or raw).Full wayback suite: 23/23 passing (16 prior + 7 new). No regressions.
Backwards-compat
Param is optional with a sensible default (
None). No callers need to change.🤖 Generated with Claude Code