Skip to content

Add downsample and upsample transforms#258

Open
david-rx wants to merge 1 commit into
mainfrom
david/downsample-upsample
Open

Add downsample and upsample transforms#258
david-rx wants to merge 1 commit into
mainfrom
david/downsample-upsample

Conversation

@david-rx
Copy link
Copy Markdown
Contributor

@david-rx david-rx commented Apr 1, 2026

Downsample and upsample randomly to a given ratio. Could be either here or NatureLM-specific

Summary

  • Downsample: New transform that randomly keeps a fraction of rows. Configured via fraction (0, 1] and seed. Useful for quick dev iterations or ablation experiments.
  • Upsample: New transform that repeats every row N times via concatenation. Configured via factor (>= 2). Useful for epoch-matching small datasets with larger ones during training.

Both transforms follow the existing esp-data conventions (Pydantic config with Literal type discriminator, from_config classmethod, DataBackend protocol, register_transform registration).

Test plan

  • test_downsample.py: fraction accuracy, full fraction (1.0), small fraction floor (>= 1 row), manual vs config parity, config validation — all parametrized over pandas and polars backends
  • test_upsample.py: length correctness, row preservation, manual vs config parity, config validation — all parametrized over pandas and polars backends
  • All 16 tests pass locally
  • Pre-commit hooks pass (ruff, codespell, trailing whitespace, etc.)

Made with Cursor

Downsample randomly keeps a fraction of rows (useful for dev/ablation).
Upsample repeats every row N times (useful for epoch-matching small datasets).

Both follow the existing transform conventions: Pydantic config with
Literal type discriminator, from_config classmethod, DataBackend protocol,
and register_transform registration. Tests cover both pandas and polars
backends.

Made-with: Cursor
@david-rx david-rx requested a review from a team as a code owner April 1, 2026 04:18
@david-rx david-rx requested a review from GaganNarula April 1, 2026 04:49
The downsampled backend and empty metadata dict.
"""
n = max(1, round(len(backend) * self.fraction))
return backend.sample_rows(n=n, seed=self.seed), {}
Copy link
Copy Markdown
Collaborator

@GaganNarula GaganNarula Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if fraction == 1.0, then maybe we should just return the backend ? Not sure if the backends are shuffling even if fraction==1.0 (polars sets shuffle = False by default)

Copy link
Copy Markdown
Collaborator

@GaganNarula GaganNarula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these transform docs to docs/transforms.md ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants