feat(medical): host souvikmajumder26/Multi-Agent-Medical-Assistant on BenchFlow — native vs hosted + per-node trajectories by Yiminnn · Pull Request #846 · benchflow-ai/benchflow

Yiminnn · 2026-06-28T21:15:46Z

Summary

Hosts souvikmajumder26/Multi-Agent-Medical-Assistant — a LangGraph supervisor → specialists agent (router → KB/web specialists → confidence-gated handoff → output guardrail) — on BenchFlow, two ways:

As a first-class benchmark under benchmarks/medical-assistant/, run in a Docker sandbox with the LLM routed through the LiteLLM proxy (the proper, isolated path).
As a comparison (examples/medical/) showing native (direct LLM) vs BenchFlow-hosted (LLM via proxy), and how to separate a multi-agent run's per-node trajectories.

1. Hosted as a BenchFlow benchmark (Docker)

Following the benchmarks/ convention: a registered medical-assistant ACP agent (the LangGraph graph) + 3 Harbor drug-safety tasks with deterministic verifiers. Each rollout runs in its own Docker sandbox, agent sandboxed as non-root agent, deepseek-v4-pro through the proxy.

Result: 3/3 (100%), errors=0, 3.2 min (full per-rollout table + trajectories in the PR comment).

File	What
`src/benchflow/agents/medical_acp_shim.py`	`medical-assistant` ACP-shim agent — embeds the supervisor→specialists graph; each node streamed as its own ACP `tool_call` step. Registered in `agents/registry.py` (185 registry-invariant tests pass).
`benchmarks/medical-assistant/`	`benchmark.yaml`, 3 Harbor tasks (`metformin` / `aspirin` / `ibuprofen`) with keyword verifiers, job config, run script, README.

set -a; . ~/sb-run.env; set +a
python benchmarks/medical-assistant/run_medical_assistant.py
# or: bench eval run --config benchmarks/medical-assistant/medical-assistant-deepseek.yaml

Per rollout, BenchFlow captures both <rollout>/trajectory/llm_trajectory.jsonl (proxy raw-LLM) and acp_trajectory.jsonl (per-node steps), so the multi-agent structure is visible.

Scope vs. the full upstream app

The upstream's heavy stack — Azure embeddings, Qdrant RAG, torch CV weights, Tavily — is out of scope on the deepseek-only proxy path. The router → specialists → confidence handoff → guardrail control flow is hosted faithfully, with a small in-process knowledge base standing in for RAG. CV/imaging specialists and live web search are not included. (benchmark.yaml → hosting.not_included.)

2. Native vs BenchFlow-hosted (examples/medical/)

Same graph, same query (metformin side effects). Behaviour identical; observability is the whole difference: native gives only the answer; routing each specialist's LLM through BenchFlow yields usage/cost + the raw-LLM trajectory, and the agent never sees the raw provider key.

Separating the agents — one proxy per specialist

A single shared proxy emits one combined llm_trajectory.jsonl (the callback records only model + messages, no per-agent tag → a shared log can't be split after the fact). examples/medical/run_native_vs_hosted.py gives each specialist its own proxy → a separate trajectory + usage per agent (supervisor / answer / guardrail). In the benchmark path, the same per-node visibility comes from the ACP trajectory (each node is a tool_call step).

Also in this branch (supporting runtime, not the focus)

The hosting reuses BenchFlow's loopback LiteLLM-proxy plumbing (ensure_litellm_runtime / usage / canonical trajectory). An earlier, orthogonal inter-agent-concurrency scaffold (src/benchflow/arena/, examples/arena/) also rides on this branch; it is independent of the medical scope above and can be split into its own PR if preferred.

Adds src/benchflow/arena/ — a self-contained, OPT-IN scaffold for the missing "inter-agent" axis of multi-agent: N agents acting concurrently on ONE shared environment (the deferred arena-concurrent mode). Touches no existing scored path — the sequential scene path and the scalar reward path are untouched. - protocol.py — turn-poll contract (Seam 3): observe/act, status in {waiting, not_your_turn, your_turn(request_id), done}, stale-request rejection. - runtime.py — run_arena (Seam 1 + eval glue): N seats concurrently via asyncio.gather; a worker cap bounds concurrent DECISIONS (not waiting, so interdependent seats never deadlock); a wall-clock deadline reaps stragglers. - reward.py — SharedEnvReward (Seam 4): one shared standings map -> a per-seat reward vector (pvp net / coop joint). Each seat is one RolloutNode, so this rides the existing per-node Reward contract — no change to the scalar path. tests/test_arena.py: an in-memory turn-gated fake env + scripted policies prove concurrency, turn-gating, stale-request rejection, straggler reaping, and zero- sum scoring — no ACP/sandbox/LLM (asyncio.run, no plugin dep). examples/arena/duel_deepseek.py: a REAL run — two deepseek-v4-pro seats play one shared rock-paper-scissors round concurrently through run_arena. Verified live: seat-0 paper vs seat-1 scissors -> reward -100 / +100, chips conserved. Seam 2 (a co-tenant SharedManifestEnvironment provisioning one service per scene) is deferred to the caller; this scaffold drives any SeatClient. Build the rest at the trigger — when an in-tree benchmark needs BenchFlow to host concurrent seats.

… trajectories Each arena seat's raw LLM now goes through BenchFlow's provider proxy (so usage/ cost is tracked per agent), and every turn is captured as a per-seat trajectory. - policy.py: ProxyChatPolicy + provider_config — reads BENCHFLOW_PROVIDER_BASE_URL /_API_KEY/_MODEL (the proxy the SDK injects; falls back to provider-native env), calls the OpenAI-compatible /chat/completions, tags each request with x-bf-seat, and records the call. - trajectory.py: SeatTrajectory — per-seat <seat>.trajectory.jsonl (status, observation, legal_actions, action, llm {model, messages, response, usage}). - runtime.py: run_arena gains an on_turn hook for per-seat decision capture. tests: ProxyChatPolicy routes through BENCHFLOW_PROVIDER_* (httpx MockTransport, no real LLM); SeatTrajectory writes per-seat jsonl; on_turn emits one record per move; a 3-seat concurrent run. examples: floor_deepseek.py (N-seat high-card via the proxy env) and run_through_proxy.py — a REAL run: 3 deepseek-v4 seats through a loopback proxy started by ensure_litellm_runtime. Verified live: the raw key is hidden from the seats (proxy isolation invariant), proxy usage is aggregated (3298 tokens, $0.0014, usage_source=provider_response), and each seat's decision trajectory is captured with real model content.

…format After the run, write the proxy's captured exchanges to <run>/trajectory/llm_trajectory.jsonl via trajectory.to_jsonl(redact_keys=True) — mirroring rollout._write_llm_trajectory — so the per-seat raw-LLM calls land in the same on-disk trajectory format the viewer/verifier already read. Verified live: 3 deepseek-v4 seats through the proxy -> 3 raw exchanges captured + per-seat usage/cost, alongside the per-seat decision trajectories.

…roxy Demonstrates the intra-agent axis alongside the inter-agent arena: a real langgraph.StateGraph in the Multi-Agent-Medical-Assistant pattern (supervisor/ router -> KB / web-search specialists -> confidence-gated handoff -> output guardrail), run HOSTED BY BenchFlow — every node's LLM goes through a loopback LiteLLM proxy (ensure_litellm_runtime), so per-node usage/cost is tracked and the raw-LLM exchanges persist to <run>/trajectory/llm_trajectory.jsonl. The agent never sees the raw provider key. Verified live (deepseek-v4-pro): path supervisor -> retrieve_kb -> answer -> web_search -> answer -> guardrail (confidence gate fired), 4 node LLM calls captured, usage/cost aggregated through the proxy. langgraph/langchain-openai are example-only deps (not added to the core).

examples/sample_runs/ packs two real proxy-hosted runs (3-seat arena + LangGraph medical assistant) with their trajectories + a README of the commands/config. Adds a short pre-stop flush delay in the runners so the proxy's canonical llm_trajectory.jsonl captures as many calls as possible (final-callback flush is a LiteLLM timing artifact; per-seat decision trajectories are complete).

Yiminnn · 2026-06-28T23:23:10Z

📦 Sample run results — multi-agent hosted by BenchFlow (proxy-tracked + trajectories)

Two real runs, each with every agent's raw LLM routed through a loopback BenchFlow LiteLLM proxy (ensure_litellm_runtime(environment="local")) — per-call usage/cost aggregated, raw exchanges persisted to <run>/trajectory/llm_trajectory.jsonl, and the raw provider key stripped from the agent (isolation invariant).

⬇️ Download the packed results folder: examples/sample_runs/proxy-runs.tar.gz (raw) — contains arena-floor-proxy/, medical-assistant/, each with run.log + trajectory/llm_trajectory.jsonl (+ per-seat trajectories), and a README.md.

Shared config

model deepseek/deepseek-v4-pro (provider-prefixed → deepseek upstream) · proxy ensure_litellm_runtime(agent="deepagents", environment="local") · key DEEPSEEK_API_KEY (stripped from the agent by the proxy).

1) Inter-agent arena — 3 concurrent seats

set -a; . ./sb-run.env; set +a
uv run python examples/arena/run_through_proxy.py 3

→ 3 deepseek-v4 seats play one shared high-card round via run_arena; all reasoned to the dominant pick → tie → chips conserved 3000; proxy usage 2953 tokens, $0.0012 (usage_source=provider_response); per-seat decision trajectories + canonical llm_trajectory.jsonl.

2) Intra-agent LangGraph Medical-Assistant

uv pip install langgraph langchain-openai
set -a; . ./sb-run.env; set +a
MEDICAL_CONFIDENCE_THRESHOLD=1.01 \
  uv run python examples/medical/run_through_proxy.py "What are the main side effects of metformin?"

→ real langgraph.StateGraph, path supervisor → retrieve_kb → answer → web_search → answer → guardrail (confidence-gated handoff fired); proxy usage 1177 tokens, $0.00045; 4 node LLM calls.

Note: the proxy's canonical llm_trajectory.jsonl may show N-1 of N calls (LiteLLM async-callback flush timing on stop) — every call did route through the proxy, as the aggregated usage confirms; the per-seat/per-node decision trajectories are complete.

Yiminnn · 2026-06-29T00:20:25Z

Medical-Assistant bench — native vs BenchFlow-hosted

Same query (What are the main side effects of metformin?), same LangGraph (supervisor → specialists → guardrail), MEDICAL_CONFIDENCE_THRESHOLD=1.01 (forces the confidence-gated web handoff so answer fires twice).

	Native (direct deepseek)	Hosted (BenchFlow proxy)
Agent path	supervisor → retrieve_kb → answer → web_search → answer → guardrail	identical
route / confidence / safe	kb / 1.0 / True	kb / 0.9 / True
Answer	same clinical answer	same clinical answer
Usage / cost	none (no proxy → nothing tracked)	1293 tok · $0.00050, per agent
Raw `llm_trajectory`	none	one per agent (below)

Behaviour is identical; observability is the whole difference. Native gives you only the answer. Routing each specialist's LLM call through BenchFlow is what yields usage/cost + the raw-LLM trajectory, and the agent never sees the raw provider key.

Why the original hosted run was a single file — and the fix

LangGraph runs the nodes sequentially, and they shared one proxy → one callback log → one combined llm_trajectory.jsonl in call order. The callback records only model + messages (no per-call agent tag), so a shared log cannot be split after the fact. Giving each specialist its own proxy → its own log:

out/medical-hosted/supervisor/trajectory/llm_trajectory.jsonl  (1 call(s))
out/medical-hosted/answer/trajectory/llm_trajectory.jsonl  (2 call(s))
out/medical-hosted/guardrail/trajectory/llm_trajectory.jsonl  (1 call(s))

Entire running folder (`out/medical-hosted/`) — full records

supervisor/trajectory/llm_trajectory.jsonl — 1 call(s)

{
  "request": {
    "timestamp": "2026-06-29T00:15:28.120343",
    "method": "POST",
    "path": "/v1/chat/completions",
    "headers": {},
    "body": {
      "model": "deepseek-v4-pro",
      "messages": [
        {
          "content": "You route a medical question to a specialist. Question: 'What are the main side effects of metformin?'. Reply with ONE word: 'kb' if a drug/condition knowledge base likely covers it, else 'web'.",
          "role": "user"
        }
      ],
      "input": [
        {
          "content": "You route a medical question to a specialist. Question: 'What are the main side effects of metformin?'. Reply with ONE word: 'kb' if a drug/condition knowledge base likely covers it, else 'web'.",
          "role": "user"
        }
      ],
      "stream": false
    }
  },
  "response": {
    "timestamp": "2026-06-29T00:15:31.109319",
    "status_code": 200,
    "headers": {},
    "body": {
      "id": "252168fc-1e5b-441c-ba0a-b78ede67386b",
      "created": 1782692129,
      "model": "benchflow-deepseek-deepseek-v4-pro",
      "object": "chat.completion",
      "system_fingerprint": "fp_9954b31ca7_prod0820_fp8_kvcache_20260402",
      "choices": [
        {
          "finish_reason": "stop",
          "index": 0,
          "message": {
            "content": "kb",
            "role": "assistant",
            "tool_calls": null,
            "function_call": null,
            "reasoning_content": "We are asked: \"You route a medical question to a specialist. Question: 'What are the main side effects of metformin?'. Reply with ONE word: 'kb' if a drug/condition knowledge base likely covers it, else 'web'.\"\n\nThe question is about side effects of a common drug, metformin. Knowledge bases (kb) typically contain structured information on drug side effects. This is a straightforward factual question that a drug database like those used in medical settings would cover. So the answer is 'kb'.",
            "provider_specific_fields": {
              "refusal": null,
              "reasoning_content": "We are asked: \"You route a medical question to a specialist. Question: 'What are the main side effects of metformin?'. Reply with ONE word: 'kb' if a drug/condition knowledge base likely covers it, else 'web'.\"\n\nThe question is about side effects of a common drug, metformin. Knowledge bases (kb) typically contain structured information on drug side effects. This is a straightforward factual question that a drug database like those used in medical settings would cover. So the answer is 'kb'."
            }
          },
          "provider_specific_fields": {}
        }
      ],
      "usage": {
        "completion_tokens": 108,
        "prompt_tokens": 50,
        "total_tokens": 158,
        "completion_tokens_details": {
          "accepted_prediction_tokens": null,
          "audio_tokens": null,
          "reasoning_tokens": 106,
          "rejected_prediction_tokens": null
        },
        "prompt_tokens_details": {
          "audio_tokens": null,
          "cached_tokens": 0
        },
        "prompt_cache_hit_tokens": 0,
        "prompt_cache_miss_tokens": 50
      },
      "service_tier": null
    }
  },
  "duration_ms": 2988.976001739502
}

answer/trajectory/llm_trajectory.jsonl — 2 call(s)

{
  "request": {
    "timestamp": "2026-06-29T00:15:32.348103",
    "method": "POST",
    "path": "/v1/chat/completions",
    "headers": {},
    "body": {
      "model": "deepseek-v4-pro",
      "messages": [
        {
          "content": "Question: What are the main side effects of metformin?\nEvidence: Metformin is first-line for type 2 diabetes; common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis.\nAnswer concisely for a clinician. Then on a NEW line output 'CONFIDENCE: <0.0-1.0>' for how well the evidence supports your answer.",
          "role": "user"
        }
      ],
      "input": [
        {
          "content": "Question: What are the main side effects of metformin?\nEvidence: Metformin is first-line for type 2 diabetes; common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis.\nAnswer concisely for a clinician. Then on a NEW line output 'CONFIDENCE: <0.0-1.0>' for how well the evidence supports your answer.",
          "role": "user"
        }
      ],
      "stream": false
    }
  },
  "response": {
    "timestamp": "2026-06-29T00:15:38.421150",
    "status_code": 200,
    "headers": {},
    "body": {
      "id": "20acd434-6593-4336-99bb-d668102357d0",
      "created": 1782692134,
      "model": "benchflow-deepseek-deepseek-v4-pro",
      "object": "chat.completion",
      "system_fingerprint": "fp_9954b31ca7_prod0820_fp8_kvcache_20260402",
      "choices": [
        {
          "finish_reason": "stop",
          "index": 0,
          "message": {
            "content": "Metformin commonly causes gastrointestinal upset (nausea, diarrhea), which can be mitigated by slow dose titration and taking with food. Rarely, it may cause lactic acidosis, particularly in patients with significant renal impairment (contraindicated if eGFR <30 mL/min/1.73 m²).  \nCONFIDENCE: 1.0",
            "role": "assistant",
            "tool_calls": null,
            "function_call": null,
            "reasoning_content": "We are asked: \"Question: What are the main side effects of metformin? Evidence: Metformin is first-line for type 2 diabetes; common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis. Answer concisely for a clinician. Then on a NEW line output 'CONFIDENCE: <0.0-1.0>' for how well the evidence supports your answer.\"\n\nWe need to provide a concise answer for a clinician, summarizing the main side effects. The evidence given: \"common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis.\" So answer: GI upset (nausea, diarrhea) are common; lactic acidosis is a rare but serious side effect. We can mention that it's generally well-tolerated, but GI effects often improve with dose titration. Also, lactic acidosis risk is increased in renal impairment, so contraindicated if eGFR <30. But the evidence only mentions those points. The answer should be concise.\n\nThen we output confidence score. The evidence directly supports the answer: it states common side effects and a rare one. Confidence high, like 1.0. But we need to interpret: The evidence says \"common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis.\" This is accurate. So confidence: 1.0.",
            "provider_specific_fields": {
              "refusal": null,
              "reasoning_content": "We are asked: \"Question: What are the main side effects of metformin? Evidence: Metformin is first-line for type 2 diabetes; common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis. Answer concisely for a clinician. Then on a NEW line output 'CONFIDENCE: <0.0-1.0>' for how well the evidence supports your answer.\"\n\nWe need to provide a concise answer for a clinician, summarizing the main side effects. The evidence given: \"common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis.\" So answer: GI upset (nausea, diarrhea) are common; lactic acidosis is a rare but serious side effect. We can mention that it's generally well-tolerated, but GI effects often improve with dose titration. Also, lactic acidosis risk is increased in renal impairment, so contraindicated if eGFR <30. But the evidence only mentions those points. The answer should be concise.\n\nThen we output confidence score. The evidence directly supports the answer: it states common side effects and a rare one. Confidence high, like 1.0. But we need to interpret: The evidence says \"common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis.\" This is accurate. So confidence: 1.0."
            }
          },
          "provider_specific_fields": {}
        }
      ],
      "usage": {
        "completion_tokens": 350,
        "prompt_tokens": 86,
        "total_tokens": 436,
        "completion_tokens_details": {
          "accepted_prediction_tokens": null,
          "audio_tokens": null,
          "reasoning_tokens": 280,
          "rejected_prediction_tokens": null
        },
        "prompt_tokens_details": {
          "audio_tokens": null,
          "cached_tokens": 0
        },
        "prompt_cache_hit_tokens": 0,
        "prompt_cache_miss_tokens": 86
      },
      "service_tier": null
    }
  },
  "duration_ms": 6073.046922683716
}
{
  "request": {
    "timestamp": "2026-06-29T00:15:38.471754",
    "method": "POST",
    "path": "/v1/chat/completions",
    "headers": {},
    "body": {
      "model": "deepseek-v4-pro",
      "messages": [
        {
          "content": "Question: What are the main side effects of metformin?\nEvidence: Metformin is first-line for type 2 diabetes; common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis. [web] current guidance relevant to: What are the main side effects of metformin?\nAnswer concisely for a clinician. Then on a NEW line output 'CONFIDENCE: <0.0-1.0>' for how well the evidence supports your answer.",
          "role": "user"
        }
      ],
      "input": [
        {
          "content": "Question: What are the main side effects of metformin?\nEvidence: Metformin is first-line for type 2 diabetes; common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis. [web] current guidance relevant to: What are the main side effects of metformin?\nAnswer concisely for a clinician. Then on a NEW line output 'CONFIDENCE: <0.0-1.0>' for how well the evidence supports your answer.",
          "role": "user"
        }
      ],
      "stream": false
    }
  },
  "response": {
    "timestamp": "2026-06-29T00:15:42.344671",
    "status_code": 200,
    "headers": {},
    "body": {
      "id": "df717fc6-808b-42e0-90d3-0f0893bfb76d",
      "created": 1782692138,
      "model": "benchflow-deepseek-deepseek-v4-pro",
      "object": "chat.completion",
      "system_fingerprint": "fp_9954b31ca7_prod0820_fp8_kvcache_20260402",
      "choices": [
        {
          "finish_reason": "stop",
          "index": 0,
          "message": {
            "content": "The main side effects of metformin are gastrointestinal disturbances (notably nausea, diarrhea, and abdominal discomfort), which are common and often dose-dependent. A rare but serious adverse effect is lactic acidosis, particularly in patients with renal impairment or other contraindications. Long-term use can also lead to vitamin B12 deficiency.\n\nCONFIDENCE: 0.9",
            "role": "assistant",
            "tool_calls": null,
            "function_call": null,
            "reasoning_content": "We are asked: \"Question: What are the main side effects of metformin?\" with evidence: \"Metformin is first-line for type 2 diabetes; common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis. [web] current guidance relevant to: What are the main side effects of metformin?\" Then answer concisely for a clinician, then on a new line output 'CONFIDENCE: <0.0-1.0>'.\n\nI need to answer based on the evidence. The evidence mentions GI upset (nausea, diarrhea) and rarely lactic acidosis. That's it. So I'll answer: common GI side effects (nausea, diarrhea, abdominal discomfort) and rare lactic acidosis, especially in renal impairment. I'll add that B12 deficiency can also occur with long-term use, but the evidence doesn't mention it, but it's well-known. However, the instruction is to answer from the evidence provided. The evidence says \"common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis.\" So I'll stick to that. I'll phrase it for a clinician. Then confidence: The evidence is from a web source, probably reliable, but I'll rate confidence high maybe 0.9. I'll say CONFIDENCE: 0.9.",
            "provider_specific_fields": {
              "refusal": null,
              "reasoning_content": "We are asked: \"Question: What are the main side effects of metformin?\" with evidence: \"Metformin is first-line for type 2 diabetes; common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis. [web] current guidance relevant to: What are the main side effects of metformin?\" Then answer concisely for a clinician, then on a new line output 'CONFIDENCE: <0.0-1.0>'.\n\nI need to answer based on the evidence. The evidence mentions GI upset (nausea, diarrhea) and rarely lactic acidosis. That's it. So I'll answer: common GI side effects (nausea, diarrhea, abdominal discomfort) and rare lactic acidosis, especially in renal impairment. I'll add that B12 deficiency can also occur with long-term use, but the evidence doesn't mention it, but it's well-known. However, the instruction is to answer from the evidence provided. The evidence says \"common side effects are GI upset (nausea, diarrhea) and, rarely, lactic acidosis.\" So I'll stick to that. I'll phrase it for a clinician. Then confidence: The evidence is from a web source, probably reliable, but I'll rate confidence high maybe 0.9. I'll say CONFIDENCE: 0.9."
            }
          },
          "provider_specific_fields": {}
        }
      ],
      "usage": {
        "completion_tokens": 346,
        "prompt_tokens": 104,
        "total_tokens": 450,
        "completion_tokens_details": {
          "accepted_prediction_tokens": null,
          "audio_tokens": null,
          "reasoning_tokens": 274,
          "rejected_prediction_tokens": null
        },
        "prompt_tokens_details": {
          "audio_tokens": null,
          "cached_tokens": 0
        },
        "prompt_cache_hit_tokens": 0,
        "prompt_cache_miss_tokens": 104
      },
      "service_tier": null
    }
  },
  "duration_ms": 3872.9169368743896
}

guardrail/trajectory/llm_trajectory.jsonl — 1 call(s)

{
  "request": {
    "timestamp": "2026-06-29T00:15:44.866217",
    "method": "POST",
    "path": "/v1/chat/completions",
    "headers": {},
    "body": {
      "model": "deepseek-v4-pro",
      "messages": [
        {
          "content": "Output guardrail. The assistant answered: 'The main side effects of metformin are gastrointestinal disturbances (notably nausea, diarrhea, and abdominal discomfort), which are common and often dose-dependent. A rare but serious adverse effect is lactic acidosis, particularly in patients with renal impairment or other contraindications. Long-term use can also lead to vitamin B12 deficiency.'. Is it safe and appropriately cautious as medical information (not a diagnosis)? Reply yes/no.",
          "role": "user"
        }
      ],
      "input": [
        {
          "content": "Output guardrail. The assistant answered: 'The main side effects of metformin are gastrointestinal disturbances (notably nausea, diarrhea, and abdominal discomfort), which are common and often dose-dependent. A rare but serious adverse effect is lactic acidosis, particularly in patients with renal impairment or other contraindications. Long-term use can also lead to vitamin B12 deficiency.'. Is it safe and appropriately cautious as medical information (not a diagnosis)? Reply yes/no.",
          "role": "user"
        }
      ],
      "stream": false
    }
  },
  "response": {
    "timestamp": "2026-06-29T00:15:50.276964",
    "status_code": 200,
    "headers": {},
    "body": {
      "id": "576d4dcb-0170-40b5-990d-d85895178409",
      "created": 1782692147,
      "model": "benchflow-deepseek-deepseek-v4-pro",
      "object": "chat.completion",
      "system_fingerprint": "fp_9954b31ca7_prod0820_fp8_kvcache_20260402",
      "choices": [
        {
          "finish_reason": "stop",
          "index": 0,
          "message": {
            "content": "yes",
            "role": "assistant",
            "tool_calls": null,
            "function_call": null,
            "reasoning_content": "We are asked to evaluate if the assistant's answer is safe and appropriately cautious as medical information (not a diagnosis). The answer states: \"The main side effects of metformin are gastrointestinal disturbances (notably nausea, diarrhea, and abdominal discomfort), which are common and often dose-dependent. A rare but serious adverse effect is lactic acidosis, particularly in patients with renal impairment or other contraindications. Long-term use can also lead to vitamin B12 deficiency.\"\n\nThis seems factually accurate, mentions common side effects, a rare but serious side effect, and a long-term risk. It doesn't provide a diagnosis, just lists side effects. It's cautious by mentioning the seriousness of lactic acidosis and the context of renal impairment. So yes, it's safe and appropriately cautious.",
            "provider_specific_fields": {
              "refusal": null,
              "reasoning_content": "We are asked to evaluate if the assistant's answer is safe and appropriately cautious as medical information (not a diagnosis). The answer states: \"The main side effects of metformin are gastrointestinal disturbances (notably nausea, diarrhea, and abdominal discomfort), which are common and often dose-dependent. A rare but serious adverse effect is lactic acidosis, particularly in patients with renal impairment or other contraindications. Long-term use can also lead to vitamin B12 deficiency.\"\n\nThis seems factually accurate, mentions common side effects, a rare but serious side effect, and a long-term risk. It doesn't provide a diagnosis, just lists side effects. It's cautious by mentioning the seriousness of lactic acidosis and the context of renal impairment. So yes, it's safe and appropriately cautious."
            }
          },
          "provider_specific_fields": {}
        }
      ],
      "usage": {
        "completion_tokens": 155,
        "prompt_tokens": 94,
        "total_tokens": 249,
        "completion_tokens_details": {
          "accepted_prediction_tokens": null,
          "audio_tokens": null,
          "reasoning_tokens": 153,
          "rejected_prediction_tokens": null
        },
        "prompt_tokens_details": {
          "audio_tokens": null,
          "cached_tokens": 0
        },
        "prompt_cache_hit_tokens": 0,
        "prompt_cache_miss_tokens": 94
      },
      "service_tier": null
    }
  },
  "duration_ms": 5410.747051239014
}

…es + native-vs-hosted runner Route each Medical-Assistant node (supervisor/answer/guardrail) through its own LiteLLM proxy so each agent yields a separate llm_trajectory + usage, instead of one shared proxy emitting a single combined log. Adds run_native_vs_hosted.py to run the bench direct-vs-hosted and diff observability.

…hmark Host the souvikmajumder26/Multi-Agent-Medical-Assistant pattern (LangGraph supervisor→specialists: router → KB/web specialists → confidence-gated handoff → guardrail) as a first-class BenchFlow benchmark that runs in a Docker sandbox with deepseek-v4-pro routed through the LiteLLM proxy. - medical-assistant ACP-shim agent (src/benchflow/agents/medical_acp_shim.py), registered in agents/registry.py; each graph node streamed as its own ACP tool_call step so the multi-agent structure is visible in the trajectory. - benchmarks/medical-assistant/: benchmark.yaml, 3 Harbor drug-safety tasks with deterministic keyword verifiers, job config, run script, README. - Verified: bench run in Docker → 3/3, errors=0; per-rollout llm_trajectory (proxy raw-LLM) + acp_trajectory (per-node steps) captured. The upstream's Azure-embeddings/Qdrant/CV/Tavily stack is out of scope on the deepseek-only path; the router→specialists→handoff→guardrail control flow is hosted faithfully with an in-process KB standing in for RAG.

Yiminnn · 2026-06-29T01:36:22Z

Hosted as a BenchFlow benchmark — run in Docker (deepseek-v4-pro via the proxy)

Per the benchmarks/ convention, the Multi-Agent-Medical-Assistant is now a first-class benchmark: a registered medical-assistant ACP agent (the LangGraph supervisor→specialists graph) + 3 Harbor drug-safety tasks. Each rollout runs in its own Docker sandbox (environment: docker), agent sandboxed as non-root agent, every node's LLM routed through the LiteLLM proxy.

Result: 3/3 (100%), errors=0, 3.2 min.

Task	Reward	Agent path (per-node, from acp_trajectory)	LLM calls	Tokens	Cost
`aspirin-secondary-prevention`	1.0	supervisor → retrieve_kb → answer → guardrail	3	919	$0.00035
`ibuprofen-renal-caution`	1.0	supervisor → retrieve_kb → answer → guardrail	3	1395	$0.00055
`metformin-side-effects`	1.0	supervisor → retrieve_kb → answer → guardrail	3	726	$0.00027
total	3/3			3040	$0.00117

Every rollout captured both trajectories under <rollout>/trajectory/:

llm_trajectory.jsonl — the proxy's raw-LLM log (3 calls: supervisor, answer, guardrail; retrieve_kb is non-LLM).
acp_trajectory.jsonl — the multi-agent structure as per-node ACP tool_call steps, so which specialist ran (and in what order) is visible even though the raw calls share one proxy log.

Usage is usage_source=provider_response (proxy-metered), and the agent never saw the raw provider key.

Note on the verifier

First run scored ibuprofen 0.5: the answer was clinically correct (named both real cautions — renal impairment + anticoagulant/bleeding) but the verifier required the drug-class token "NSAID" rather than the cautions the question asks for. The verifier was corrected to test the two actual cautions (matches the task KB); the re-run above is the honest harness result.

Run it

set -a; . ~/sb-run.env; set +a
python benchmarks/medical-assistant/run_medical_assistant.py
# or: bench eval run --config benchmarks/medical-assistant/medical-assistant-deepseek.yaml

Yiminnn temporarily deployed to pypi-internal-preview June 28, 2026 21:16 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview June 28, 2026 21:37 — with GitHub Actions Inactive

Yiminnn changed the title ~~feat(arena): opt-in inter-agent concurrent runtime scaffold (+ live deepseek-v4 run)~~ feat(arena): inter-agent concurrent runtime — proxy-tracked seats + per-seat trajectories (live deepseek-v4) Jun 28, 2026

Yiminnn temporarily deployed to pypi-internal-preview June 28, 2026 21:41 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview June 28, 2026 23:16 — with GitHub Actions Inactive

Yiminnn changed the title ~~feat(arena): inter-agent concurrent runtime — proxy-tracked seats + per-seat trajectories (live deepseek-v4)~~ feat(medical): host souvikmajumder26/Multi-Agent-Medical-Assistant on BenchFlow — native vs hosted + per-node trajectories Jun 29, 2026

Yiminnn deployed to pypi-internal-preview June 29, 2026 01:36 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(medical): host souvikmajumder26/Multi-Agent-Medical-Assistant on BenchFlow — native vs hosted + per-node trajectories#846

feat(medical): host souvikmajumder26/Multi-Agent-Medical-Assistant on BenchFlow — native vs hosted + per-node trajectories#846
Yiminnn wants to merge 7 commits into
mainfrom
feat/arena-concurrent-scaffold

Yiminnn commented Jun 28, 2026 •

edited

Loading

Uh oh!

Yiminnn commented Jun 28, 2026

Uh oh!

Yiminnn commented Jun 29, 2026

Uh oh!

Yiminnn commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Yiminnn commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Hosted as a BenchFlow benchmark (Docker)

Scope vs. the full upstream app

2. Native vs BenchFlow-hosted (examples/medical/)

Separating the agents — one proxy per specialist

Also in this branch (supporting runtime, not the focus)

Uh oh!

Yiminnn commented Jun 28, 2026

📦 Sample run results — multi-agent hosted by BenchFlow (proxy-tracked + trajectories)

Shared config

1) Inter-agent arena — 3 concurrent seats

2) Intra-agent LangGraph Medical-Assistant

Uh oh!

Yiminnn commented Jun 29, 2026

Medical-Assistant bench — native vs BenchFlow-hosted

Why the original hosted run was a single file — and the fix

Entire running folder (out/medical-hosted/) — full records

Uh oh!

Yiminnn commented Jun 29, 2026

Hosted as a BenchFlow benchmark — run in Docker (deepseek-v4-pro via the proxy)

Note on the verifier

Run it

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Yiminnn commented Jun 28, 2026 •

edited

Loading

Entire running folder (`out/medical-hosted/`) — full records