Skip to content

Add Harbor/Terminal-Bench integration via Daytona and local Docker#15

Open
doxav wants to merge 1 commit into
AgentOpt:mainfrom
doxav:harbor
Open

Add Harbor/Terminal-Bench integration via Daytona and local Docker#15
doxav wants to merge 1 commit into
AgentOpt:mainfrom
doxav:harbor

Conversation

@doxav
Copy link
Copy Markdown
Collaborator

@doxav doxav commented May 8, 2026

This PR introduces support for Terminal-Bench tasks via the Harbor CLI. Because Terminal-Bench evaluates an agent's ability to execute shell commands and modify system states, these tasks require strict sandboxing to prevent agents from inadvertently damaging the host system or accessing sensitive local files.

When using Daytona service ?
By default, running these benchmarks in cloud notebook environments (like Google Colab) requires nested virtualization or running a local Docker daemon, which is not possible on Google Colab.
We integrated Daytona (harbor_env: daytona) as a remote execution backend because:

=> It allows us to offload the sandboxed environments securely without needing a local Docker daemon.
=> Security & Isolation: Agents perform potentially destructive system interventions safely away from the orchestrator.
=> Consistency: Ensure reproducibility across runs regardless of where the trace_bench script is actually executing.

How to run locally (Without Daytona) ?
If you are running trace_bench locally on a machine with Docker installed, you do not need a Daytona API key or service. You can simply switch the Harbor environment to use the local Docker daemon.

To do this, configure harbor_env: docker instead of daytona in your evaluation configuration:

tasks:
  - id: terminal_bench:regex-log
      eval_kwargs:
            harbor_dataset: terminal-bench@2.0
                  # Use "docker" for local execution, or "daytona" for cloud execution
                  harbor_env: docker
                  harbor_model: openrouter/openai/gpt-4o-mini

When local execution is triggered, Harbor will build and spin up isolated Docker containers directly on your machine.

@allenanie
Copy link
Copy Markdown
Member

This is so amazing -- I've been thinking if we should process all our data into Harbor format!

@allenanie
Copy link
Copy Markdown
Member

Also I'm happy to help integrate modal/daytona completely into all tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants