Motivation
The most discriminating bugs for a debugging-discipline skill are the ones that only reproduce
under conditions the developer's machine doesn't have — a different timezone, locale, or clock.
That environment-dependence is exactly what makes the lazy "works for me" path tempting, which is
what gives the skill a measurable delta (see slow-powers evaluating-skills → "Engineering an
attractive failure").
The new investigating-bugs timezone eval needs this concretely:
repro.ts must be green under TZ=UTC (the default sandbox) and red under
TZ=America/Los_Angeles — so the disciplined move is to vary TZ to reproduce.
- The held-out test (issue:
command_check) must run under a TZ matrix so that a fix is only
credited if it is correct in every offset: a "force local parse" decoy passes US offsets but
breaks the save round-trip in positive offsets (verified: under Europe/Berlin,
new Date("2024-12-31T00:00:00").toISOString() rolls back to 2024-12-30).
- DST/
isDueToday cases need a pinned clock to be deterministic.
Proposed surface
Extend command_check (and the repro step) with environment and clock control:
{
"id": "all_consumers_correct",
"type": "command_check",
"command": "bun test ./holdout/tz-matrix.holdout.ts",
"env": { "TZ": "America/Los_Angeles" },
"matrix": { "TZ": ["UTC", "America/Los_Angeles", "Pacific/Kiritimati", "Europe/Berlin"] },
"clock": "2024-03-10T08:30:00Z",
"expect_exit_code": 0
}
env: per-command environment variables.
matrix: run the command once per combination of the listed values; all cells must pass
for the assertion to pass. Report per-cell results.
clock: pin wall-clock time (via a libfaketime-style wrapper where available, or a documented
fixture clock-injection hook) so DST/now-dependent logic is deterministic.
Also: document (and ideally pin) the agent sandbox's default TZ (e.g. UTC) so authors can
rely on "the naive repro is green" as the hard-to-reproduce hook.
Acceptance criteria
- A command/repro can be run under a specified
TZ/env.
- A
matrix runs N cells and fails the assertion if any cell fails, with per-cell reporting.
clock makes a Date.now()/new Date()-dependent test deterministic, or the limitation is
documented with the fixture-injection alternative.
- The default sandbox TZ is documented.
Back-compat
Additive fields; omitting them preserves today's behavior. matrix interacts with --runs
multiplicatively — document the cost.
Motivation
The most discriminating bugs for a debugging-discipline skill are the ones that only reproduce
under conditions the developer's machine doesn't have — a different timezone, locale, or clock.
That environment-dependence is exactly what makes the lazy "works for me" path tempting, which is
what gives the skill a measurable delta (see slow-powers
evaluating-skills→ "Engineering anattractive failure").
The new
investigating-bugstimezone eval needs this concretely:repro.tsmust be green underTZ=UTC(the default sandbox) and red underTZ=America/Los_Angeles— so the disciplined move is to vary TZ to reproduce.command_check) must run under a TZ matrix so that a fix is onlycredited if it is correct in every offset: a "force local parse" decoy passes US offsets but
breaks the save round-trip in positive offsets (verified: under
Europe/Berlin,new Date("2024-12-31T00:00:00").toISOString()rolls back to2024-12-30).isDueTodaycases need a pinned clock to be deterministic.Proposed surface
Extend
command_check(and the repro step) with environment and clock control:{ "id": "all_consumers_correct", "type": "command_check", "command": "bun test ./holdout/tz-matrix.holdout.ts", "env": { "TZ": "America/Los_Angeles" }, "matrix": { "TZ": ["UTC", "America/Los_Angeles", "Pacific/Kiritimati", "Europe/Berlin"] }, "clock": "2024-03-10T08:30:00Z", "expect_exit_code": 0 }env: per-command environment variables.matrix: run the command once per combination of the listed values; all cells must passfor the assertion to pass. Report per-cell results.
clock: pin wall-clock time (via alibfaketime-style wrapper where available, or a documentedfixture clock-injection hook) so DST/
now-dependent logic is deterministic.Also: document (and ideally pin) the agent sandbox's default
TZ(e.g.UTC) so authors canrely on "the naive repro is green" as the hard-to-reproduce hook.
Acceptance criteria
TZ/env.matrixruns N cells and fails the assertion if any cell fails, with per-cell reporting.clockmakes aDate.now()/new Date()-dependent test deterministic, or the limitation isdocumented with the fixture-injection alternative.
Back-compat
Additive fields; omitting them preserves today's behavior.
matrixinteracts with--runsmultiplicatively — document the cost.