Skip to content

fix(cognition): max_concurrency: usize::MAX panics tokio Semaphore at registration#1419

Closed
joelteply wants to merge 1 commit into
canaryfrom
fix/cognition-max-concurrency-semaphore-overflow
Closed

fix(cognition): max_concurrency: usize::MAX panics tokio Semaphore at registration#1419
joelteply wants to merge 1 commit into
canaryfrom
fix/cognition-max-concurrency-semaphore-overflow

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Latent bug — CognitionModule panics at Runtime registration

Symptom

`CognitionModule::config()` sets `max_concurrency: usize::MAX` as a mistaken way to spell "no cap". The `Runtime::register()` path calls `tokio::sync::Semaphore::new(max_concurrency)` whenever `max_concurrency > 0`. tokio's MAX_PERMITS is `usize::MAX >> 3` (~2^61), so `Semaphore::new(usize::MAX)` panics immediately:

```
thread '...' panicked at tokio-1.50.0/src/sync/batch_semaphore.rs:141:9:
a semaphore may not have more than MAX_PERMITS permits (2305843009213693951)
```

Why it's been latent

Existing cognition tests bypass the panic by calling `module.handle_command` directly without going through `runtime.register`. The IPC path that production uses (`ipc/mod.rs:944` → `runtime.register(Arc::new(CognitionModule::new(...)))` → `runtime.route_command` for cognition commands) IS affected.

Surfaced while building Lane D `persona/turn-execute` integration tests that go through `runtime.route_command`.

Fix

```

  •        max_concurrency: usize::MAX,
    
  •        max_concurrency: 0,
    

```

The Runtime's `register()` only creates a Semaphore when `max_concurrency > 0` — `0` is the canonical spelling of "unbounded, no semaphore" used by other modules (`inference-llm`, the `RecordingModule` test fixtures, etc.). Preserves the original intent ("no cap on cognition; downstream ai_provider enforces inference serialization") without panicking.

Regression test

Added `registration_safety_tests::cognition_module_registers_without_semaphore_panic` that calls `Runtime::register` with CognitionModule. If `max_concurrency` is ever raised above tokio's MAX_PERMITS, this test panics at the original error message — easy breadcrumb back to this regression.

Provenance

Introduced in PR #1306 "perf(cognition): lift max_concurrency:1 cap so event fanout is not the gate". Author meant "no cap"; the right spelling for that in this codebase is `0`.

Validation

```
cd src/workers/continuum-core
cargo test --features metal,accelerate --lib modules::cognition
```

5/5 pass (4 pre-existing + 1 new regression test).

🤖 Generated with Claude Code

…ation

CognitionModule::config() set `max_concurrency: usize::MAX` as a
mistaken way to spell "no cap" (PR #1306). The Runtime's
`register()` calls `tokio::sync::Semaphore::new(max_concurrency)`
whenever `max_concurrency > 0`; tokio's MAX_PERMITS is
`usize::MAX >> 3` (~2^61), so `Semaphore::new(usize::MAX)` panics
immediately at registration with:

  thread 'X' panicked at tokio-1.50.0/src/sync/batch_semaphore.rs:141:9:
  a semaphore may not have more than MAX_PERMITS permits (2305843009213693951)

The Runtime's `register()` only instantiates a Semaphore when
`max_concurrency > 0` — `0` is the canonical spelling of
"unbounded, no semaphore" used by other modules
(`inference-llm`, `RecordingModule` test fixtures, etc.).
Switching to `0` preserves the original intent ("no cap on
cognition; downstream ai_provider enforces the inference
serialization") without panicking.

This was a latent bug surfaced while building integration tests
for the Lane D `persona/turn-execute` command (#1409) that go
through `runtime.route_command`. Existing cognition tests bypass
the panic by calling `module.handle_command` directly without
going through `runtime.register`.

* Add `registration_safety_tests::cognition_module_registers_without_semaphore_panic`
  as a regression guard: this test calls `Runtime::register` with
  CognitionModule, which panics if `max_concurrency > MAX_PERMITS`.
  Asserts `config.max_concurrency == 0` to make the dependency
  explicit for future readers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joelteply
Copy link
Copy Markdown
Contributor Author

Superseded by #1417 — codex's PR carries the same max_concurrency: usize::MAX → 0 fix (commit 5bde37e) stacked with the Rust-only dispatch fix on the same Lane D issue. Per manager coordination on airc: keeping #1417 as the single canonical bug-fix PR for #1409 risks, so the bug-fix history stays clean and reviewable. My #1419 was orthogonal in concept but redundant in scope — landing both would just be churn. Closing.

@joelteply joelteply closed this May 18, 2026
@joelteply joelteply deleted the fix/cognition-max-concurrency-semaphore-overflow branch May 18, 2026 19:14
@joelteply
Copy link
Copy Markdown
Contributor Author

REOPENING — manager confirmed on airc that this Semaphore panic is the root cause for 'site does literally nothing' on canary. Every npm start hits Runtime::register → Semaphore::new(usize::MAX) → tokio MAX_PERMITS overflow → CognitionModule registration panic → worker dies. This standalone fix is the smallest fast-track patch to unblock canary; #1417 carries the same fix but bundled with the Rust-only dispatch fix which is broader scope. Landing this first restores npm start, then #1417 can land cleanly on top of fixed canary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant