In practice with real eval runs using claude and codex, the "hybrid" run mode has proven to be most convenient, especially with the new runbook artifact providing a one-stop source of details about the mechanics of the run. Additionally, the main reason for starting with an "interactive" mode for claude was the upcoming pricing difference between claude -p and interactive claude sessions. This proposed change has been reversed, and it seems like claude -p will function with rolling session usage for the foreseeable future.
I think this provides the opportunity to reassess if there's value in supporting multiple run modes right now. They present a technical challenge, add complexity to the code base, and now might have a clear "best" option. "Headless" mode seems like it's possible to tacitly support, since mechanically it should be the same as hybrid mode, but I don't really think there's much value to a human driving the process themselves, especially if dropping any claims for this mode simplifies the code further.
This leads into the ultimate goal of this exploration, what I'm calling "instant magic". Given that hybrid run mode becomes the only run mode, can we remove most to all of our harness compatibility requirements? A modern harness can likely be trusted to understand its own capabilities, or know the best way to find that information online. The compatibility required for hybrid mode should just be knowing how to map harness capabilities (starting in headless mode, like claude -p) to eval magic functionality (dispatching an eval agent). We could even maintain documentation around known mappings like this, which would be a much simpler sort of compatibility maintenance.
This would mean that any harness could be compatible with eval-magic, as long as it supports certain features natively. We can document what's required, and even provide a "minimum operating list" for the bare minimum requirements for running an eval, along with progressive enhancement functionality. We'd then no longer be forever playing catchup with harness compatibility, instead providing a general interface for working with our program. We'd also get a big code maintenance win, hopefully being able to avoid references to any one harness or another, or hardcoding the details of how a harness works in our code.
Plan mode is also technically a compatibility concern, but what we actually provide is simulated plan mode instructions based on what the harness is allowed to expose about its plan mode. In the end, these instructions are actually all basically the same. We may as well just provide our own sample plan mode, since it can't be exactly accurate anyway.
The goal of this ticket will be to explore the feasibility of this change. There may be gotchas I haven't considered, or parts of the code base I haven't thought about that would be impacted by this. We should especially make sure we understand if any true, difficult-to-replace functionality would be lost with this. I feel the potential upsides are very high, so it's worth thinking about seriously, even if there are trade-offs.
In practice with real eval runs using claude and codex, the "hybrid" run mode has proven to be most convenient, especially with the new runbook artifact providing a one-stop source of details about the mechanics of the run. Additionally, the main reason for starting with an "interactive" mode for claude was the upcoming pricing difference between
claude -pand interactive claude sessions. This proposed change has been reversed, and it seems likeclaude -pwill function with rolling session usage for the foreseeable future.I think this provides the opportunity to reassess if there's value in supporting multiple run modes right now. They present a technical challenge, add complexity to the code base, and now might have a clear "best" option. "Headless" mode seems like it's possible to tacitly support, since mechanically it should be the same as hybrid mode, but I don't really think there's much value to a human driving the process themselves, especially if dropping any claims for this mode simplifies the code further.
This leads into the ultimate goal of this exploration, what I'm calling "instant magic". Given that hybrid run mode becomes the only run mode, can we remove most to all of our harness compatibility requirements? A modern harness can likely be trusted to understand its own capabilities, or know the best way to find that information online. The compatibility required for hybrid mode should just be knowing how to map harness capabilities (starting in headless mode, like
claude -p) to eval magic functionality (dispatching an eval agent). We could even maintain documentation around known mappings like this, which would be a much simpler sort of compatibility maintenance.This would mean that any harness could be compatible with eval-magic, as long as it supports certain features natively. We can document what's required, and even provide a "minimum operating list" for the bare minimum requirements for running an eval, along with progressive enhancement functionality. We'd then no longer be forever playing catchup with harness compatibility, instead providing a general interface for working with our program. We'd also get a big code maintenance win, hopefully being able to avoid references to any one harness or another, or hardcoding the details of how a harness works in our code.
Plan mode is also technically a compatibility concern, but what we actually provide is simulated plan mode instructions based on what the harness is allowed to expose about its plan mode. In the end, these instructions are actually all basically the same. We may as well just provide our own sample plan mode, since it can't be exactly accurate anyway.
The goal of this ticket will be to explore the feasibility of this change. There may be gotchas I haven't considered, or parts of the code base I haven't thought about that would be impacted by this. We should especially make sure we understand if any true, difficult-to-replace functionality would be lost with this. I feel the potential upsides are very high, so it's worth thinking about seriously, even if there are trade-offs.