chore(deps)!: 7-day dependency cooldown (uv exclude-newer + CI assert) by xdotli · Pull Request #788 · benchflow-ai/benchflow

xdotli · 2026-06-15T23:27:26Z

What

Adds a hard 7-day dependency cooldown: uv may never resolve a dependency to a release younger than ~7 days old. Motivation — the freshly-published-CVE churn during #787 (advisories landing on day-0 releases); a cooldown gives new releases time to be vetted.

How (two layers, both evaluated against today)

Static cap, dynamically set. [tool.uv] exclude-newer caps every uv lock at a timestamp so the default resolution can't reach a brand-new release. It has to be a static literal (CI's uv sync --locked / uv export --locked must match the lock deterministically — a live value in pyproject would break --locked). So the date isn't hand-typed: python tools/lock.py computes midnight-UTC of now − 7d, writes it, and re-locks in one step. tests/test_dep_cooldown.py::test_uv_exclude_newer_caps_resolution fails if the committed cutoff is younger than 7 days.
Dynamic enforcement, offline. test_locked_packages_respect_cooldown reads the upload-time that uv bakes into uv.lock for every resolved package and fails if any is younger than now − 7d. This is the genuinely live "7 days ago from the current date" check — evaluated every CI run against today, no stored date trusted and no PyPI queries (timestamps are already in the lock). It never flakes on an unchanged lock: upload times are immutable, so packages only age out of the window.

COOLDOWN_DAYS = 7 lives once in tools/lock.py and is imported by the gate, so the writer and the check can't drift.

Security / grandfather exception

uv has no rolling cooldown natively, so deps the repo already pins that are still <7 days old get a one-time override in [tool.uv.exclude-newer-package] (also the escape hatch for an urgent CVE fix). Each entry is exempted from layer 2 and should be removed as it ages past 7d:

litellm[proxy]==1.89.0 (2026-06-13) — exact pin; resolution fails without it.
starlette 1.3.1 (2026-06-12) — the CVE-2026-54282/54283 fix; the override keeps it instead of rolling back to the vulnerable 1.2.1.

Verified the audit has teeth: with exemptions removed it flags exactly litellm 1.89.0 and starlette 1.3.1; with them, it's clean.

Impact

Re-locking under the cap rolled 6 langchain-family packages back by a patch/minor (langchain, -anthropic, -core, -google-genai, -openai, langsmith) — their newest releases are <7d old. The CVE fixes from #787 (aiohttp 3.14.1, python-multipart 0.0.32, starlette 1.3.1) are preserved. Full suite 4202 passed; ruff/format/ty clean. python tools/lock.py --check today prints the committed 2026-06-08T00:00:00Z, so the lock is unchanged.

… hard rule Never resolve a dependency to a release younger than ~7 days (less-vetted; the freshly-published-CVE churn during #787 is the motivating case). Enforced two ways: - `[tool.uv] exclude-newer` caps every `uv lock` at a fixed timestamp (currently 2026-06-08, >=7 days ago); advance it (kept >=7d in the past) to take updates. - `tests/test_dep_cooldown.py` fails CI if `exclude-newer` is ever within the last 7 days, so the cap can't silently drift forward. Grandfather overrides in `[tool.uv.exclude-newer-package]` for deps the repo already pins that are still <7d old (each should be removed as it ages past 7d): - litellm[proxy]==1.89.0 (2026-06-13) — exact pin; blocks resolution otherwise. - starlette 1.3.1 (2026-06-12) — CVE-2026-54282/54283 fix; don't roll back to the vulnerable 1.2.1. Security exception is the override mechanism: an urgent fix younger than the cap is allowed via a commented per-package entry. Re-locking under the cap rolled 6 langchain-family packages back by a patch/minor (their newest releases are <7d old); full suite green (4190).

greptile-apps · 2026-06-15T23:30:40Z

Greptile Summary

This PR introduces a hard 7-day dependency cooldown enforced at two layers: a static [tool.uv] exclude-newer timestamp in pyproject.toml (kept at now − 7 days by tools/lock.py) and a dynamic CI gate in tests/test_dep_cooldown.py that reads upload timestamps from uv.lock on every run. The approach avoids PyPI network calls and is deterministic because upload times in uv.lock are immutable. Six langchain-family packages were rolled back as a result of the tighter cap; the pydantic-settings CVE fix is preserved via a one-time per-package override.

tools/lock.py: new helper that computes midnight-UTC of now − 7d, rewrites exclude-newer in pyproject.toml, then shells out to uv lock. Includes LockError wrapping for I/O and subprocess failures, and exposes pure functions consumed directly by the test suite.
tests/test_dep_cooldown.py + tests/test_lock_tool.py: two complementary test files — one gating the live repo state, one unit-testing the helper's pure functions with a fixed synthetic lock.
pyproject.toml: adds [tool.uv] required-version = \">=0.8.4\" (floor for the per-package override table), sets the initial exclude-newer, and registers the pydantic-settings temporary exception with a documented comment.

Confidence Score: 5/5

Safe to merge. The cooldown enforcement is self-consistent across both layers, all previous findings have been addressed, and the pydantic-settings override will self-destruct via the CI gate on 2026-06-27 as designed.

The two-layer enforcement is logically coherent: the static cap and the dynamic lock audit use the same COOLDOWN_DAYS constant imported from the single source of truth, OSError wrapping is now complete across I/O and subprocess paths, and per-package overrides have an enforced expiry. The pydantic-settings override will cause test_exclude_newer_package_overrides_expire_with_cooldown to fail starting tomorrow — but that is the mechanism working as intended, not a defect.

The pydantic-settings override in pyproject.toml ages out on 2026-06-27 and must be removed (along with a re-lock) promptly after merge.

Important Files Changed

Filename	Overview
tools/lock.py	New helper: computes cooldown cutoff, rewrites pyproject.toml in-place, and shells out to `uv lock`. I/O and subprocess failures are all wrapped in `LockError`. Logic is clean; regex correctly targets only the global `exclude-newer` line, not per-package table values.
tests/test_dep_cooldown.py	Live CI gate: checks the global cutoff is ≥ 7 days old, audits locked packages for cooldown violations, and enforces that per-package overrides are removed once they age past the window. Three-test structure cleanly separates the two enforcement layers.
tests/test_lock_tool.py	Comprehensive unit tests for the pure functions in `tools/lock.py` using a fixed synthetic lock. Covers cutoff arithmetic, idempotency, timezone normalisation, regex behaviour, OSError wrapping, and the override-expiry logic.
pyproject.toml	Adds `[tool.uv]` block with `required-version`, `exclude-newer`, and a `pydantic-settings` temporary override. The override cutoff (2026-06-20) will trigger `test_exclude_newer_package_overrides_expire_with_cooldown` starting 2026-06-27 — this is the intended behaviour of the system.
uv.lock	Six langchain packages rolled back to earlier versions that predate the cooldown window; `[options]` block added to record the global and per-package cutoffs. Mechanical output of `uv lock` under the new cap.
AGENTS.md	Single-line addition documenting the cooldown workflow for contributors.

_{Reviews (6): Last reviewed commit: "Fix dependency cooldown override expiry" | Re-trigger Greptile}

greptile-apps · 2026-06-15T23:30:44Z

+def test_uv_exclude_newer_enforces_dependency_cooldown() -> None:
+    cfg = tomllib.loads(_PYPROJECT.read_text(encoding="utf-8"))
+    raw = cfg.get("tool", {}).get("uv", {}).get("exclude-newer")
+    assert raw, (
+        "[tool.uv] exclude-newer is missing from pyproject.toml — it enforces the "
+        f"{COOLDOWN_DAYS}-day dependency cooldown and must be set."
+    )
+    cutoff = datetime.datetime.fromisoformat(str(raw).replace("Z", "+00:00"))
+    if cutoff.tzinfo is None:
+        cutoff = cutoff.replace(tzinfo=datetime.UTC)
+    age = datetime.datetime.now(datetime.UTC) - cutoff
+    assert age >= datetime.timedelta(days=COOLDOWN_DAYS), (
+        f"[tool.uv] exclude-newer ({raw}) is only {age.days}d old; the dependency "
+        f"cooldown requires it to be >= {COOLDOWN_DAYS} days in the past. Bump it "
+        f"to an older date (>= {COOLDOWN_DAYS}d ago) when taking dependency updates."
+    )


Grandfather overrides have no CI expiry enforcement

The test guards [tool.uv] exclude-newer but never inspects [tool.uv.exclude-newer-package]. The per-package overrides are the one channel through which a package newer than the global cooldown can enter the lockfile, yet nothing checks that those entries carry a required comment, or that they're removed once the package crosses the 7-day mark. In practice the litellm and starlette entries will silently linger forever after they age out, and a future engineer can add another entry without any guardrail. Adding a second assertion that iterates cfg["tool"]["uv"].get("exclude-newer-package", {}) and verifies each override timestamp is still within some reasonable tolerance (e.g., entry date ≥ global exclude-newer date) would close this gap without much code.

greptile-apps · 2026-06-15T23:30:45Z

+    assert age >= datetime.timedelta(days=COOLDOWN_DAYS), (
+        f"[tool.uv] exclude-newer ({raw}) is only {age.days}d old; the dependency "
+        f"cooldown requires it to be >= {COOLDOWN_DAYS} days in the past. Bump it "
+        f"to an older date (>= {COOLDOWN_DAYS}d ago) when taking dependency updates."
+    )


The error message uses age.days, which is the integer days component of the timedelta, not the total elapsed days rounded. For a gap like 6 days 23 h 50 min, it prints "6d old" while the assertion correctly sees it as still less than 7 days — the diagnostic makes the boundary look farther away than it is. Prefer age.total_seconds() / 86400 for accurate sub-day precision.

Suggested change

assert age >= datetime.timedelta(days=COOLDOWN_DAYS), (

f"[tool.uv] exclude-newer ({raw}) is only {age.days}d old; the dependency "

f"cooldown requires it to be >= {COOLDOWN_DAYS} days in the past. Bump it "

f"to an older date (>= {COOLDOWN_DAYS}d ago) when taking dependency updates."

)

assert age >= datetime.timedelta(days=COOLDOWN_DAYS), (

f"[tool.uv] exclude-newer ({raw}) is only {age.total_seconds() / 86400:.1f}d old; "

f"the dependency cooldown requires it to be >= {COOLDOWN_DAYS} days in the past. "

f"Bump it to an older date (>= {COOLDOWN_DAYS}d ago) when taking dependency updates."

)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: edfa207867

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-15T23:32:34Z

+# comment recording why. Remove the override once the package is >= 7 days old.
+exclude-newer = "2026-06-08T00:00:00Z"
+
+[tool.uv.exclude-newer-package]


Pin uv before using exclude-newer-package

In environments that still have uv 0.7.x, this new table is an unknown [tool.uv] key: uv sync --locked reports unknown field exclude-newer-package, ignores the lockfile cutoff as removed, and tries to re-resolve from PyPI instead of installing the committed lock. Since the repo only requires users to have uv and does not set tool.uv.required-version, local setup and any older CI image can fail or bypass the intended cooldown; pin/install a uv version that supports per-package cutoffs before relying on this syntax.

Useful? React with 👍 / 👎.

The 7-day cooldown date in [tool.uv] exclude-newer was hand-typed. Add tools/lock.py to compute it as midnight-UTC of (today - 7d) and re-lock in one step, so the date is set dynamically instead of guessed. The committed value stays static (CI's uv sync --locked must be deterministic against the lock); the helper just rolls it forward to the newest still-vetted day. - tools/lock.py: compute_cooldown_cutoff + rewrite_exclude_newer + CLI (--check / --no-lock); single COOLDOWN_DAYS source imported by the gate. - tests/test_lock_tool.py: unit tests for the date/text transforms. - test_dep_cooldown.py: import COOLDOWN_DAYS from tools.lock (no drift). - pyproject/AGENTS.md: point contributors at the helper, not hand-edits.

…times The cutoff date in [tool.uv] exclude-newer is inherently static (CI's uv sync --locked must match the lock), so trusting that date alone isn't a live '7 days ago from today' check. Add the dynamic half: read the upload-time uv bakes into uv.lock for every resolved package and fail if any is younger than now-7d, honoring documented [tool.uv.exclude-newer-package] exemptions. This evaluates the window against the current date on every CI run, stays offline (timestamps are in the lock — no PyPI queries), and never flakes on an unchanged lock (upload times are immutable, so packages only age out). - tools/lock.py: newest_upload_times + find_cooldown_violations. - test_dep_cooldown.py: split into caps-resolution (static cutoff >= 7d) + locked-packages-respect-cooldown (dynamic, the real invariant). - test_lock_tool.py: audit unit tests (latest-artifact, sourceless skip, exemptions, monotonic leniency over time).

greptile-apps · 2026-06-15T23:53:39Z

+        original = args.pyproject.read_text(encoding="utf-8")
+        updated = rewrite_exclude_newer(original, cutoff)
+        if updated != original:
+            args.pyproject.write_text(updated, encoding="utf-8")


OSError from file I/O is not caught. If args.pyproject doesn't exist (e.g., --pyproject wrong.toml) or lacks write permission, read_text() / write_text() raise FileNotFoundError / PermissionError, which fall outside the except LockError handler and produce a raw Python traceback. Catching OSError and converting it to LockError would give the same clean exit as every other failure path.

Suggested change

original = args.pyproject.read_text(encoding="utf-8")

updated = rewrite_exclude_newer(original, cutoff)

if updated != original:

args.pyproject.write_text(updated, encoding="utf-8")

try:

original = args.pyproject.read_text(encoding="utf-8")

except OSError as exc:

raise LockError(f"Cannot read {args.pyproject}: {exc}") from exc

updated = rewrite_exclude_newer(original, cutoff)

if updated != original:

try:

args.pyproject.write_text(updated, encoding="utf-8")

except OSError as exc:

raise LockError(f"Cannot write {args.pyproject}: {exc}") from exc

bingran-you · 2026-06-16T05:34:44Z

Automation triage (2026-06-16): CI is green (test, pip-audit, eval-and-judge) and the PR is mergeable, but it has no human approval yet. Per repo policy, bot/self review is not enough for main, so I labeled it review:pending and am leaving it unmerged until a human reviewer approves.\n\nNo code changes made in this scan.

bingran-you · 2026-06-16T05:39:29Z

Follow-up automation review (2026-06-16): holding this despite green CI. Two dependency-policy blockers need author follow-up before human review/merge:\n\n1. [tool.uv.exclude-newer-package] relies on uv support that landed in uv 0.8.4, but the PR does not enforce a minimum uv version. Add [tool.uv] required-version = \">=0.8.4\" and consider pinning the CI setup-uv version so local/CI behavior cannot silently diverge on older uv.\n2. The sdist allowlist currently includes tests/ but not tools/, while the new tests import tools.lock. Either include tools in [tool.hatch.build.targets.sdist].only-include or stop shipping tests that depend on repo-only helper code.\n\nI moved the PR from review:pending to review:changes-requested. No merge until these are addressed, checks rerun, and a human review approves.

xdotli · 2026-06-16T05:52:36Z

When this matters in practice

The cooldown only changes your workflow when you're updating dependencies (adding one, bumping a version, or pulling in a CVE fix). At that point, re-lock with:

python tools/lock.py

It sets [tool.uv] exclude-newer to today − 7d and re-locks in one step. Day-to-day there's nothing to do — no daily/scheduled bumps. The static cutoff only ages (safe), and the dynamic lock check only gets more lenient over time, so an unchanged lock never flips red.

The guardrail holds even if someone doesn't know about the helper

python tools/lock.py → cooldown applied at resolve time, lands clean ✓
plain uv lock that grabs a day-0 release → CI fails, listing any package younger than 7 days: "re-lock with python tools/lock.py, or add a commented override" 🛑

Escape hatch

A genuinely urgent fix that's <7 days old (e.g. an active CVE) goes through a commented [tool.uv.exclude-newer-package] override — a deliberate "accepting this fresh package, here's why" entry that exempts it from the dynamic check. Each one ages out (becomes a no-op) once the package crosses 7 days, so it's cleanup-when-convenient, not a scheduled chore.

TL;DR: update deps → run tools/lock.py; everything else is automatic.

Addresses two dep-cooldown policy gaps: 1. `required-version = ">=0.8.4"` — the `[tool.uv.exclude-newer-package]` override table is only honored by uv >= 0.8.4. Without a floor an older uv would silently ignore it and resolve a different lock. 2. Add `tools` to the sdist `only-include` allowlist. The shipped `tests/test_dep_cooldown.py` does `from tools.lock import ...`, so the sdist must include `tools` or the suite fails to import from a source distribution. Verified `uv build --sdist` now ships `tools/lock.py`. uv lock --check clean; test_dep_cooldown.py passes (2).

bingran-you · 2026-06-16T06:05:33Z

Automation (2026-06-16): addressed both policy blockers I raised earlier, pushed as 7b92cbd3:

uv version floor — added required-version = ">=0.8.4" under [tool.uv]. The exclude-newer-package override table is only honored by uv ≥ 0.8.4, so an older uv would silently ignore it.
sdist ships tools/ — added tools to [tool.hatch.build.targets.sdist].only-include. tests/test_dep_cooldown.py does from tools.lock import ...; verified uv build --sdist now bundles tools/lock.py next to the test.

Verification: uv lock --check clean (lock unchanged), test_dep_cooldown.py 2 passed, sdist tarball confirmed to contain tools/lock.py.

Still holding for a human approval before merge per repo policy (bot/self review insufficient for main).

bingran-you · 2026-06-21T08:48:45Z

Automation update (2026-06-21): rebased/merged current main into this branch and resolved the uv.lock conflict at head ea645df.\n\nValidation passed locally:\n- uv sync --extra dev --extra sandbox-daytona --locked\n- uv lock --check\n- uv run python -m pytest tests/test_dep_cooldown.py tests/test_lock_tool.py -q\n- uv run ruff check pyproject.toml tests/test_dep_cooldown.py tests/test_lock_tool.py tools/lock.py\n- uv run ty check tools/lock.py tests/test_dep_cooldown.py tests/test_lock_tool.py\n- uv run ruff check .\n- uv run ty check src/\n- uv run python -m pytest tests/ -q (4464 passed, 12 skipped, 7 deselected)\n\nGitHub checks are green on the rebased head: test, pip-audit, detect-scope, rollout-smoke; integration-scope matrix/review-pack skipped by scope. The PR is now merge-clean. Since the previous approval predates the rebase commit, I moved labels back to review:pending for fresh non-author review before merge.

bingran-you · 2026-06-22T06:12:30Z

Automation triage (2026-06-22): not auto-merging — needs human decision on two points.

Approval is stale. The only APPROVED review is on commit 7b92cbd3, but HEAD is now ea645df4 after the later main-rebase that resolved the uv.lock conflict. Under the repo's non-author-review rule, the current head is unreviewed.
Cooldown cutoff vs. security pin (design tension). This PR sets [tool.uv] exclude-newer = "2026-06-08T00:00:00Z", but main (fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness #814) pins pydantic-settings==2.14.2 — the GHSA-4xgf-cpjx-pc3j fix published 2026-06-19, i.e. newer than the cutoff. uv lock --check passes today only because the version is explicitly pinned in uv.lock; a fresh uv lock regeneration would resolve away from 2.14.2 (it is newer than the cutoff) and silently revert the security fix. The cooldown policy and the security pin currently contradict each other.

Decision needed (owner/@xdotli): either (a) advance exclude-newer past the most recent required security pin (≥ 2026-06-19) and document that security pins override the cooldown, or (b) add an explicit cooldown-exception mechanism for security-driven pins. After that, re-request a fresh non-author review on the new head. Current normal CI is green (test, pip-audit, detect-scope, rollout-smoke); the blocker is policy correctness, not CI.

bingran-you · 2026-06-24T08:45:24Z

Automation triage (2026-06-24): no merge today. I added review:pending so the label state reflects the current gate.

The PR is merge-clean and checks are green, but it still lacks fresh non-author approval and the cooldown policy needs an owner decision against the newer security/dependency pins. Please choose whether to apply a security override or advance/wait out the cutoff, then rerun the lock/test flow before re-review.

bingran-you · 2026-06-26T20:21:38Z

Automation user-simulation review (2026-06-26): still blocked on policy, not command behavior.

Focused local simulation in the PR worktree passed:

PYTHONPATH=/tmp/benchflow-pr-sims/pr788/src python -m pytest tests/test_dep_cooldown.py tests/test_lock_tool.py -q
# 13 passed

A subagent also checked the real user surface: python tools/lock.py --check, uv lock --check, benchflow --help, and import benchflow.sdk all work. I do not see a direct BenchFlow CLI/SDK regression from the cooldown helper itself.

The remaining blocker is policy correctness. The current exemption model makes [tool.uv.exclude-newer-package] a permanent package allowlist: once a package name is listed there, the dynamic cooldown test skips it unconditionally, even after the package is older than the 7-day window. That weakens the stated "hard" cooldown unless stale overrides are failed or time-bounded. This sits on top of the existing unresolved decision about newer security pins versus the global cutoff.

Recommended owner decision before merge:

Make package overrides explicitly time-bound and fail CI once an override ages out, or
document that overrides are permanent and rename the policy accordingly.

Until then, keeping status:blocked; normal command simulation is green.

bingran-you · 2026-06-26T21:11:53Z

Automation update (2026-06-26): addressed the cooldown override blocker and pushed a4ffe465 to chore/dep-cooldown-7d.

What changed:

[tool.uv.exclude-newer-package] is now time-bounded by CI: overrides fail once their per-package cutoff is older than the active 7-day cooldown floor, so the table cannot become a permanent package allowlist.
The dynamic cooldown exemption lookup now normalizes package names.
tools/lock.py now wraps pyproject read/write and uv lock subprocess OSErrors as LockError diagnostics.
Stale litellm/starlette overrides were removed; the current active escape hatch is pydantic-settings = "2026-06-20T00:00:00Z" for the GHSA security pin, with global exclude-newer = "2026-06-19T00:00:00Z".

Validation:

uv sync --extra dev --locked
uv lock --check
uv run python -m pytest tests/test_dep_cooldown.py tests/test_lock_tool.py -> 19 passed
uv run ruff check pyproject.toml tools/lock.py tests/test_dep_cooldown.py tests/test_lock_tool.py
uv run ty check tools/lock.py tests/test_dep_cooldown.py tests/test_lock_tool.py
python tools/lock.py --check -> 2026-06-19T00:00:00Z

bingran-you · 2026-06-26T21:20:29Z

Automation follow-up (2026-06-26): GitHub checks are now green on a4ffe465 after the time-bounded override fix.

Green checks:

test
pip-audit
integration-light / detect-scope
integration-light / rollout-smoke
integration-scope / detect-scope

The previous policy blocker is now encoded in CI: package-specific cooldown overrides expire once their cutoff ages past the 7-day cooldown floor, stale litellm/starlette overrides are removed, and the active pydantic-settings security exception is explicit and temporary. I moved status:blocked -> status:ready; leaving review:pending because the pushed head still needs fresh non-author human review before merge.

bingran-you · 2026-06-26T21:55:22Z

Users Simulation review (2026-06-26): blocked.

The dependency-cooldown tooling passes its current tests, but the simulation found that the audit is fail-open. tools/lock.py only inspects packages with upload-time; a quick audit of the current uv.lock found 179 registry packages without timestamps, including agent-client-protocol, aiofiles, and aiohappyeyeballs. That means find_cooldown_violations() skips most of the resolved graph, while the test/docstring claims full-lock coverage.

Checks recorded:

uv --version
uv lock --check
uv sync --extra dev --locked
uv run python tools/lock.py --check
uv run bench --help
uv run bench eval run --help
uv run bench tasks --help
uv run python -m pytest tests/test_dep_cooldown.py tests/test_lock_tool.py -q  # 19 passed
uv run ruff format --check tools/lock.py tests/test_dep_cooldown.py tests/test_lock_tool.py
uv run ruff check tools/lock.py tests/test_dep_cooldown.py tests/test_lock_tool.py
uv run ty check tools
uv build --no-sources --out-dir /tmp/benchflow-build-pr788-full

I set status:blocked and review:changes-requested. Please either fail closed on missing registry timestamps or narrow the test/doc claim so the check does not overstate its coverage.

bingran-you · 2026-06-27T12:22:02Z

Users Simulation automation review (2026-06-27T12:21Z): blocked.

The new cooldown gate is rejecting this branch itself. pyproject.toml:207 and the mirrored lock entry at uv.lock:13 still pin pydantic-settings = "2026-06-20T00:00:00Z"; as of this run, that is past the 7-day floor, and tests/test_dep_cooldown.py:70 fails. The focused pytest result was 1 failed, 18 passed.

Commands/evidence:

uv sync --locked --extra dev --dry-run
uv lock --check
uv run python tools/lock.py --check
CLI smoke: benchflow --help, benchflow eval --help, benchflow eval run --help
SDK/import smoke for benchflow, benchflow.sdk, and tools.lock
uv run python -m pytest tests/test_dep_cooldown.py tests/test_lock_tool.py -q reproduced the failure
uv run ruff check tools/lock.py tests/test_dep_cooldown.py tests/test_lock_tool.py passed
uv run ty check src/ tools/ passed

Please remove or refresh the expired pydantic-settings override and re-lock.

bingran-you · 2026-06-27T17:14:55Z

Automation follow-up (2026-06-27): pushed 40d71b33 and 7facf574 to clear the dependency-cooldown blockers.

Fixed:

removed the expired pydantic-settings per-package override instead of refreshing it;
refreshed the global cooldown cutoff with tools/lock.py to 2026-06-20T00:00:00Z and re-locked;
regenerated uv.lock now has upload-time metadata for every registry package I checked (0 missing), so the dynamic cooldown audit no longer silently skips most of the graph;
fixed the stricter ty diagnostic exposed by the refreshed lock in src/benchflow/task/prompts.py by narrowing model is None before prefix checks.

Validation:

uv sync --extra dev --locked
uv lock --check
uv run python tools/lock.py --check
# 2026-06-20T00:00:00Z
uv run python -m pytest tests/test_dep_cooldown.py tests/test_lock_tool.py -q
# 19 passed
uv run ty check
uv run ruff check src/benchflow/task/prompts.py tools/lock.py tests/test_dep_cooldown.py tests/test_lock_tool.py
uv run ruff format --check src/benchflow/task/prompts.py tools/lock.py tests/test_dep_cooldown.py tests/test_lock_tool.py

GitHub checks on the pushed head are green (test, pip-audit, parity, detect-scope, rollout-smoke; scope-gated jobs skipped as expected). Moving status:blocked / review:changes-requested to status:ready / review:pending; it still needs fresh non-author human review before merge.

bingran-you · 2026-06-27T23:11:20Z

Users Simulation automation follow-up (2026-06-27T23:15Z): ready after the new cooldown fixes.

Validated current head 7facf57403160baf1c754291aa1850d1e41c3202 in isolated worktrees after the commits pushed at 2026-06-27T17:14Z. The previous blocker is cleared: pydantic-settings no longer has a stale [tool.uv.exclude-newer-package] override, the override table is now empty, and the current uv.lock audit reports violations=0 / expired_overrides=0.

Evidence:

uv sync --extra dev --extra sandbox-daytona --locked  # pass
uv run python tools/lock.py --check                  # 2026-06-20T00:00:00Z
uv run python -m pytest tests/test_dep_cooldown.py tests/test_lock_tool.py -q  # 19 passed
uv run ruff check .                                  # pass
uv run ty check src/                                 # pass
uv run python -m pytest tests/test_cli_docs_drift.py tests/test_tasks.py tests/test_eval_zero_task_guard.py tests/test_usage_tracking.py -q  # 71 passed
bench --help / bench agent list / bench tasks check tests/examples/hello-world-task  # pass
Python SDK smoke: import benchflow, RolloutConfig, Scene.single(agent="oracle")  # pass

A PR-scoped subagent also ran the full test suite on this head and reported 4470 passed / 12 skipped / 7 deselected, plus uv run ty check src tools and uv run ruff check src tools tests passing. GitHub checks on the new head are green for test, pip-audit, parity, and rollout-smoke.

I also attempted a low-cost model-backed Docker canary using openhands + deepseek/deepseek-v4-flash with --reasoning-effort xhigh and --usage-tracking required. It failed before any model call because OpenHands does not declare an ACP effort config option for xhigh; benchflow-experiment-review correctly rejected the artifacts as unhealthy (empty ACP trajectory, missing llm_trajectory.jsonl, zero usage, no reward). I am not treating that harness capability mismatch as a blocker for this dependency-policy PR.

Thermo-nuclear maintainability review: no blocker. tools/lock.py is a focused 281-line helper, tests cover the cooldown and temporary-override invariants, and the only runtime code change is the explicit None guard in src/benchflow/task/prompts.py.

Verdict: ready for the Users Simulation scope; keep status:ready / review:pending.

xdotli temporarily deployed to pypi-internal-preview June 15, 2026 23:27 — with GitHub Actions Inactive

greptile-apps Bot reviewed Jun 15, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 15, 2026

View reviewed changes

xdotli temporarily deployed to pypi-internal-preview June 15, 2026 23:39 — with GitHub Actions Inactive

xdotli temporarily deployed to pypi-internal-preview June 15, 2026 23:49 — with GitHub Actions Inactive

greptile-apps Bot reviewed Jun 15, 2026

View reviewed changes

bingran-you added enhancement New feature or request P2 Anti-pattern / type safety / docs precision / minor schema drift / non-deterministic but contained. review:pending PR is ready-for-review, no reviewer engagement yet. labels Jun 16, 2026

bingran-you added review:changes-requested Author needs to push more commits before this can merge. and removed review:pending PR is ready-for-review, no reviewer engagement yet. labels Jun 16, 2026

bingran-you mentioned this pull request Jun 16, 2026

Capture raw provider traces for native-protocol (GenerateContent) agents so they reach llm_trajectory parity #671

Open

5 tasks

bingran-you temporarily deployed to pypi-internal-preview June 16, 2026 06:05 — with GitHub Actions Inactive

bingran-you added review:pending PR is ready-for-review, no reviewer engagement yet. and removed review:changes-requested Author needs to push more commits before this can merge. labels Jun 16, 2026

bingran-you approved these changes Jun 16, 2026

View reviewed changes

bingran-you added the review:pending PR is ready-for-review, no reviewer engagement yet. label Jun 21, 2026

bingran-you added status:blocked Waiting on external dependency. Add a comment explaining why. review:pending PR is ready-for-review, no reviewer engagement yet. and removed review:pending PR is ready-for-review, no reviewer engagement yet. labels Jun 22, 2026

Fix dependency cooldown override expiry

a4ffe46

bingran-you temporarily deployed to pypi-internal-preview June 26, 2026 21:11 — with GitHub Actions Inactive

bingran-you added status:ready Triaged, unassigned, available to claim. and removed status:blocked Waiting on external dependency. Add a comment explaining why. labels Jun 26, 2026

chore(deps): expire pydantic cooldown override

40d71b3

bingran-you temporarily deployed to pypi-internal-preview June 27, 2026 16:55 — with GitHub Actions Inactive

fix(types): narrow document user model before prefix checks

7facf57

bingran-you temporarily deployed to pypi-internal-preview June 27, 2026 17:06 — with GitHub Actions Inactive

Uh oh!

Conversation

xdotli commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How (two layers, both evaluated against today)

Security / grandfather exception

Impact

Uh oh!

greptile-apps Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

bingran-you commented Jun 16, 2026

Uh oh!

bingran-you commented Jun 16, 2026

Uh oh!

xdotli commented Jun 16, 2026

When this matters in practice

The guardrail holds even if someone doesn't know about the helper

Escape hatch

Uh oh!

bingran-you commented Jun 16, 2026

Uh oh!

bingran-you commented Jun 21, 2026

Uh oh!

bingran-you commented Jun 22, 2026

Uh oh!

bingran-you commented Jun 24, 2026

Uh oh!

bingran-you commented Jun 26, 2026

Uh oh!

bingran-you commented Jun 26, 2026

Uh oh!

bingran-you commented Jun 26, 2026

Uh oh!

bingran-you commented Jun 26, 2026

Uh oh!

bingran-you commented Jun 27, 2026

Uh oh!

bingran-you commented Jun 27, 2026

Uh oh!

bingran-you commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xdotli commented Jun 15, 2026 •

edited

Loading

greptile-apps Bot commented Jun 15, 2026 •

edited

Loading