Skip to content

fix(agent): use standard shell code fences as the action format#357

Merged
larstalian merged 1 commit into
mainfrom
fix/standard-shell-action-fences
Jun 26, 2026
Merged

fix(agent): use standard shell code fences as the action format#357
larstalian merged 1 commit into
mainfrom
fix/standard-shell-action-fences

Conversation

@larstalian

Copy link
Copy Markdown
Collaborator

What

Shell actions in the agent loop now use the standard markdown code fence a model is trained to emit — ```bash / ```sh / ```shell / ```console / ```zsh — instead of the bespoke ```run_shell token.

Why

parse_action only recognized ```run_shell / ```finish. Any other fence failed the match and fell through to the "no recognized block → finish" path, which ends the episode and discards the action (including a submit). Instruct/cold models reach for ```bash constantly — it has overwhelming pretraining mass, while ```run_shell appears essentially only in our own prompt — so a competent rollout could die at a single step with its training signal lost.

open-range's design is one shell primitive + a finish (no per-verb tool menu), so there is nothing for a custom fence token to disambiguate; it is pure friction against the model's prior. This also matches the CodeAct finding — executable code as the action space outperforms bespoke/JSON tool encodings (Wang et al. 2024).

Changes

  • parse_action accepts bash|sh|shell|console|zsh as a shell command and finish as the terminal block.
  • The last recognized block wins, so an illustrative snippet earlier in a reply is not executed in place of the action the model actually settled on.
  • The internal action identifier stays run_shell; only the wire format the model emits changes. No back-compat alias for the old ```run_shell fence (the only emitters were our own test fixtures).
  • finish stays a sentinel — there is no standard markdown fence for "done".
  • Tests updated, plus a new case covering the standard fences and last-block selection.

Test

pytest tests/test_agent_harness.py tests/test_rllm_shim.py13 passed, 5 skipped (rllm not installed locally). ruff check, ruff format --check, and mypy clean on the changed files.

🤖 Generated with Claude Code

Models are trained to emit shell commands as standard markdown code fences
(```bash / ```sh / ```shell / ```console / ```zsh), not a bespoke ```run_shell
token. The strict run_shell-only parser silently coerced a ```bash reply into
the "no recognized block -> finish" path, ending the episode and discarding the
action (including any submit) — so a competent rollout could die at one step
with its training signal lost.

Adopt the standard code fence as the action format (CodeAct-style: executable
code is the action), drop the bespoke run_shell fence, and parse the *last*
recognized block so an illustrative snippet earlier in a reply is not executed
in place of the action the model actually settled on. The internal action
identifier stays "run_shell"; only the wire format the model emits changes.
finish stays a sentinel — there is no standard fence for "done".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@larstalian larstalian merged commit 29f1081 into main Jun 26, 2026
2 checks passed
@larstalian larstalian deleted the fix/standard-shell-action-fences branch June 26, 2026 01:38
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 26, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant