Skip to content

feat(wayback): add collapse param to query_cdx for IA CDX deduplication#15

Open
prezis wants to merge 1 commit into
mainfrom
feat/wayback-cdx-collapse-param
Open

feat(wayback): add collapse param to query_cdx for IA CDX deduplication#15
prezis wants to merge 1 commit into
mainfrom
feat/wayback-cdx-collapse-param

Conversation

@prezis

@prezis prezis commented May 2, 2026

Copy link
Copy Markdown
Owner

Why

The Internet Archive CDX server natively supports a collapse= query param that deduplicates adjacent rows on the server side. Two forms:

  • collapse=urlkey — first snapshot per unique URL within the window. Canonical use: "earliest snapshot per URL in this date range".
  • collapse=timestamp:N — collapse adjacent rows where the first N chars of timestamp match. timestamp:8 = one row per YYYYMMDD, timestamp:6 = one per YYYYMM, etc.

Without server-side collapse, callers do the year-widen-then-client-filter dance: pull every snapshot in a wide window, dedupe in Python, throw most of them away. Pushing this filter to IA cuts response sizes 10-100x for "earliest per URL" workloads and is much friendlier to IA.

Real-world example: wojak-wojtek's fetch_ism_pmi_historical.py:wayback_cdx_query adapter does exactly this client-side filter today. With this PR it can pass collapse=\"urlkey\" and let IA do the work.

Behaviour

  • collapse: str | None = Noneoptional, default unchanged. Existing callers see no difference.
  • query_cdx(..., collapse=\"urlkey\") → URL gains &collapse=urlkey.
  • query_cdx(..., collapse=\"timestamp:8\") → URL gains &collapse=timestamp%3A8 (colon URL-quoted).

Validation

_validate_collapse rejects bad inputs at URL-build time so users get a Python ValueError with a helpful message instead of an opaque CDX 500:

  • Field must be one of {urlkey, timestamp, original, statuscode, mimetype, digest, length} (the CDX schema).
  • The optional :N tail must be a positive integer.

Reference

https://archive.org/developers/wayback-cdx-server.html#collapsing

Tests

Adds 7 unit tests in tests/test_wayback.py:

  • test_build_cdx_url_collapse_omitted_by_default — None default emits no clause.
  • test_build_cdx_url_collapse_bare_field\"urlkey\" produces collapse=urlkey.
  • test_build_cdx_url_collapse_field_with_n\"timestamp:8\" is preserved (URL-quoted or raw).
  • 4 validation-rejection tests covering unknown field, unknown field with N, non-int N, and zero N.

Full wayback suite: 23/23 passing (16 prior + 7 new). No regressions.

Backwards-compat

Param is optional with a sensible default (None). No callers need to change.

🤖 Generated with Claude Code

The Internet Archive CDX server natively supports a ``collapse=`` query
param that deduplicates adjacent rows on the server side. Two forms:

  - ``collapse=urlkey`` — first snapshot per unique URL within the window
    (the canonical "earliest snapshot per URL in this date range" query).
  - ``collapse=timestamp:N`` — collapse adjacent rows where the first N
    chars of timestamp match. ``timestamp:8`` = one row per YYYYMMDD,
    ``timestamp:6`` = one per YYYYMM, etc.

Without server-side collapse, callers do the year-widen-then-client-filter
dance: pull every snapshot in a wide window, dedupe in Python, throw most
of them away. Pushing this filter to IA cuts response sizes 10-100x for
"earliest per URL" workloads.

The param is OPTIONAL; default behaviour (collapse=None) is unchanged.
Field names are validated against the CDX schema (urlkey, timestamp,
original, statuscode, mimetype, digest, length); the optional :N must be
a positive integer.

Adds 7 unit tests in tests/test_wayback.py covering: omission default,
bare-field, ``field:N`` form, and 4 validation rejections (unknown field,
unknown field with N, non-int N, zero N).

Reference: https://archive.org/developers/wayback-cdx-server.html#collapsing

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant