Skip to content

feat: environment & clock control for commands (TZ/env matrix, pinned clock) #115

Description

@slowdini

Motivation

The most discriminating bugs for a debugging-discipline skill are the ones that only reproduce
under conditions the developer's machine doesn't have
— a different timezone, locale, or clock.
That environment-dependence is exactly what makes the lazy "works for me" path tempting, which is
what gives the skill a measurable delta (see slow-powers evaluating-skills → "Engineering an
attractive failure").

The new investigating-bugs timezone eval needs this concretely:

  • repro.ts must be green under TZ=UTC (the default sandbox) and red under
    TZ=America/Los_Angeles
    — so the disciplined move is to vary TZ to reproduce.
  • The held-out test (issue: command_check) must run under a TZ matrix so that a fix is only
    credited if it is correct in every offset: a "force local parse" decoy passes US offsets but
    breaks the save round-trip in positive offsets (verified: under Europe/Berlin,
    new Date("2024-12-31T00:00:00").toISOString() rolls back to 2024-12-30).
  • DST/isDueToday cases need a pinned clock to be deterministic.

Proposed surface

Extend command_check (and the repro step) with environment and clock control:

{
  "id": "all_consumers_correct",
  "type": "command_check",
  "command": "bun test ./holdout/tz-matrix.holdout.ts",
  "env": { "TZ": "America/Los_Angeles" },
  "matrix": { "TZ": ["UTC", "America/Los_Angeles", "Pacific/Kiritimati", "Europe/Berlin"] },
  "clock": "2024-03-10T08:30:00Z",
  "expect_exit_code": 0
}
  • env: per-command environment variables.
  • matrix: run the command once per combination of the listed values; all cells must pass
    for the assertion to pass. Report per-cell results.
  • clock: pin wall-clock time (via a libfaketime-style wrapper where available, or a documented
    fixture clock-injection hook) so DST/now-dependent logic is deterministic.

Also: document (and ideally pin) the agent sandbox's default TZ (e.g. UTC) so authors can
rely on "the naive repro is green" as the hard-to-reproduce hook.

Acceptance criteria

  • A command/repro can be run under a specified TZ/env.
  • A matrix runs N cells and fails the assertion if any cell fails, with per-cell reporting.
  • clock makes a Date.now()/new Date()-dependent test deterministic, or the limitation is
    documented with the fixture-injection alternative.
  • The default sandbox TZ is documented.

Back-compat

Additive fields; omitting them preserves today's behavior. matrix interacts with --runs
multiplicatively — document the cost.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions