[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter#893
[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter#893rutayan-nv wants to merge 5 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAgent execution was centralized into BaseAgent.run and handlers now call agent.run(); CloudAIGymEnv.step advances TestRun.step before applying params; TestRun gained increment_step(); tests and a stub agent were added to validate run() delegation and step-indexed artifacts. ChangesAgent-run integration and step semantics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Around line 140-152: The finally block in _run_custom_training_loop currently
calls shutdown() directly which can raise and override the earlier return value;
wrap the shutdown invocation (getattr(agent, "shutdown", None) and the callable
check) in its own try/except Exception handler so any exceptions from shutdown
are caught and logged via logging.exception (include agent_type) and not
re-raised, ensuring the original return 0/1 from agent.train() is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 21ef8346-6b78-45d3-8a1c-29a3c393e440
📒 Files selected for processing (2)
src/cloudai/cli/handlers.pytests/test_handlers.py
- Agents that set HAS_CUSTOM_TRAINING_LOOP = True drive their own training loop; handle_dse_job calls agent.train() and skips the per-step env.step loop. - New _run_custom_training_loop helper logs exceptions, returns a process-style exit code, and always invokes agent.shutdown() (when defined) in a finally block so resources are released on both success and failure paths. - CustomTrainingLoopAgent Protocol documents the opt-in contract for type checkers and IDEs.
Pyright rejected calling _run_custom_training_loop(agent, ...) because the plain bool predicate did not narrow agent's static type from BaseAgent to CustomTrainingLoopAgent. Return TypeGuard[CustomTrainingLoopAgent] from _has_custom_training_loop so the truthy branch in handle_dse_job sees the opted-in shape and the helper can call agent.train() directly.
If agent.shutdown() raised from the finally block, Python suppressed the earlier return 0/1 from agent.train() and propagated the exception, breaking the outer test-run loop in handle_dse_job (skipped remaining scenarios, failed to accumulate err |= rc). Wrap shutdown() in its own try/except, log via logging.exception, set rc = 1, and return rc after finally so the helper always honours the (int) -> int contract. Adds tests for shutdown-only failure and combined train+shutdown failure.
3ffe893 to
9552e5a
Compare
| return installables, installer | ||
|
|
||
|
|
||
| @runtime_checkable |
There was a problem hiding this comment.
let's move this code into base_agent.py. handlers.py is already too long
as for the tests against _run_custom_training_loop: I'm starting to make the tests folder structure replicate the main code structure. so in this case, I'd place all the relevant tests you added into tests/configurator/test_base_agent.py
(not related to tests against handle_dse_job)
| agent = agent_class(env, agent_config) | ||
|
|
||
| if _has_custom_training_loop(agent): | ||
| err |= _run_custom_training_loop(agent, agent_type) |
There was a problem hiding this comment.
shouldn't we exit (immediate return err) if err is greater than zero? The existing code above doesn't really treat the err well but maybe it's the time to start doing so :D
…mutator Previously, ``test_run.step`` had no clear owner: the dispatcher set it from outside, the adapter rewound it on ``reset()``, and other callers wrote to it ad hoc. In RLlib custom-loop runs this collapsed every trial onto ``step=1``, overwriting ``trajectory.csv`` and ``env.csv`` rows. Centralize the invariant: ``TestRun.increment_step()`` is the single named mutator, and ``CloudAIGymEnv.step()`` is its only caller. One ``env.step()`` call advances the trial counter by exactly one — independent of any episode or dispatcher concept above the gym env. Contract tests in ``TestIncrementStep`` cover the API; ``test_cloudaigym`` asserts ``step`` is advanced *before* ``output_path`` and trajectory rows are computed, so cached and live trials both record the post-increment value.
…orphism Earlier commits in this PR introduced ``HAS_CUSTOM_TRAINING_LOOP`` + a ``CustomTrainingLoopAgent`` Protocol + a TypeGuard helper + an ``if/else`` in ``handle_dse_job`` to switch between the cloudai step loop and an agent-owned ``train()`` loop. That is a type-tagged conditional dispatching on agent identity — the textbook signal to replace conditional with polymorphism (Fowler). Add a default ``BaseAgent.run() -> int`` that holds the step-loop body (``select_action`` / ``env.step`` / ``update_policy`` per trial). Agents that drive their own training (RLlib, etc.) override ``run()`` to delegate to whatever loop they own and return a process-style exit code. ``handle_dse_job`` collapses to ``err |= agent.run()`` — one line, no branching, no Protocol vocabulary. The handler no longer knows that "custom training loops" exist as a category; that's an agent implementation detail. Net: -89 lines on cloudai. Surface area shrinks (no Protocol, no TypeGuard, no flag). ``test_handlers`` replaces the 5 helper unit tests + 2 dispatcher integration tests with 2 polymorphic tests asserting ``handle_dse_job`` delegates to ``agent.run()`` and propagates its return code.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Line 160: handle_dse_job currently calls agent.run() directly which can throw
and abort the whole DSE batch; wrap the call to agent.run() in a try/except that
catches any exception, sets/updates the existing err variable to a non-zero
return code (e.g., err |= 1 or err = 1) and continues processing remaining runs
instead of re-raising; ensure the catch logs the exception (including agent
identity) for debugging and references the agent.run() call and err variable so
the change is applied in the handle_dse_job function.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 843e393f-7ffd-4c66-ba6b-cd08d22ab804
📒 Files selected for processing (7)
src/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pytests/test_cloudaigym.pytests/test_handlers.pytests/test_test_scenario.py
| feedback = {"trial_index": step, "value": reward} | ||
| agent.update_policy(feedback) | ||
| logging.info(f"Step {step}: Observation: {[round(obs, 4) for obs in observation]}, Reward: {reward:.4f}") | ||
| err |= agent.run() |
There was a problem hiding this comment.
Guard agent.run() failures to preserve DSE batch execution.
Line 160 can raise out of handle_dse_job if an agent throws, which aborts remaining test runs and bypasses intended err accumulation. Convert exceptions into a non-zero rc and continue.
Suggested patch
- err |= agent.run()
+ try:
+ err |= agent.run()
+ except Exception:
+ logging.exception("Agent %s failed during run().", agent_type)
+ err |= 1
+ continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/cli/handlers.py` at line 160, handle_dse_job currently calls
agent.run() directly which can throw and abort the whole DSE batch; wrap the
call to agent.run() in a try/except that catches any exception, sets/updates the
existing err variable to a non-zero return code (e.g., err |= 1 or err = 1) and
continues processing remaining runs instead of re-raising; ensure the catch logs
the exception (including agent identity) for debugging and references the
agent.run() call and err variable so the change is applied in the handle_dse_job
function.
|
Downstream cloudaix PR that consumes this contract: https://github.com/Mellanox/cloudaix/pull/589 It ports Together the two PRs are the architectural fix for the trial-counter collapse: cloudai owns the trial-counter contract ( |
Summary
Replaces the original
HAS_CUSTOM_TRAINING_LOOPflag + Protocol + TypeGuard + if/else dispatch with polymorphism onBaseAgent.run(), and centralizes ownership ofTestRun.stepso the trial counter has a single mutator.Why this changed shape
The original design (flag + Protocol + helper + branched dispatcher) is a textbook type-tagged conditional. Live cluster runs surfaced two related bugs:
test_run.stephad no clear owner. The handler set it from outside, the gymnasium adapter rewound it onreset(), the reporter wrote it during recovery, the batch-unroll runner cloned it. With RLlib's contextual-bandit setup (agent_steps=1, frequentreset()s), the adapter collapsed every trial ontostep=1, overwritingtrajectory.csvandenv.csvrows.Dispatch was branched on agent identity. "Use polymorphism instead of switching on type" (Fowler) is the canonical fix.
This PR addresses both at once:
TestRun.increment_step()becomes the single mutator (CloudAIGymEnv.step()its only caller), andBaseAgent.run()becomes the single dispatch entry point that agents override.Commits
feat(core): TestRun owns trial counter; CloudAIGymEnv.step() is sole mutator— addsTestRun.increment_step(), calls it at the top ofCloudAIGymEnv.step(), removes the competingenv.test_run.step = stepwriter fromhandle_dse_job. NewTestIncrementSteptest class covers the API;test_cloudaigymasserts step is advanced beforeoutput_path/ trajectory rows are computed.refactor(cli): collapse custom-loop dispatch to BaseAgent.run() polymorphism— addsBaseAgent.run() -> intdefault holding the step-loop body;handle_dse_jobcollapses toerr |= agent.run(). DeletesHAS_CUSTOM_TRAINING_LOOP,CustomTrainingLoopAgentProtocol,_has_custom_training_loopTypeGuard,_run_custom_training_loophelper. Net -89 lines on cloudai; surface area shrinks (no Protocol, no TypeGuard, no flag).Agents that drive their own training (RLlib-based PPO/DQN in cloudaix) override
run()to call their ownalgo.train()and return a process-style rc. The handler doesn't know that category exists.Why not a separate "step ownership" PR?
The step-ownership bug only surfaces under the custom-training-loop path (RLlib's
reset()pattern is what collapses the trial counter). Fixing it as a precondition of the dispatch refactor keeps the two changes co-located and ensures the polymorphic dispatch lands on a step-counter contract that actually holds.Related
test_run.stepno longer mutated by the adapter, the adapter is a pure pass-through and belongs in cloudaix (the only consumer) rather than upstream.Test plan
pytest tests— 1444 passedruff check src tests— cleantest_gymnasium_adapter_contract,test_env_csv_adapter_contract) confirmtr.stepmonotonicity end-to-end through the adapter and the user-visibleenv.csvartifact.env.csv/trajectory.csvhave unique monotonic step values, no row overwrites.