Skip to content

Tajamul21/MedCTA

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🩺 MedCTA

A Benchmark for Clinical Tool Agents

Evaluating multimodal medical agents through realistic clinical tool-use workflows.


✨ Clinical tool-use evaluation, but make it realistic.

MedCTA is a benchmark for evaluating clinical tool agents on clinician-verified, multimodal medical tasks.
It tests whether an agent can select the right tools, call them correctly, use evidence faithfully, and reach a clinically meaningful final answer.


🌸 Table of Contents


🌟 Introduction

Clinical reasoning is not a single-step QA problem.
It is multimodal, iterative, tool-dependent, and evidence-sensitive.

MedCTA evaluates how well LLM-based agents can operate as clinical tool agents in realistic medical scenarios.

Unlike static medical QA benchmarks, MedCTA requires agents to:

  • understand multimodal clinical inputs,
  • decide which tools are needed,
  • call tools with valid arguments,
  • integrate intermediate observations,
  • retrieve evidence when necessary,
  • and produce clinically faithful final answers.

The benchmark is built around step-implicit clinical tasks.
The model receives a clinical objective, but it is not explicitly told which tools to use or in what order.


🧠 Why MedCTA?

Many existing medical benchmarks mainly test final-answer accuracy.
MedCTA goes deeper.

It asks:

Question What MedCTA Measures
🧭 Can the agent plan? Tool selection and trajectory fidelity
🧰 Can it use tools correctly? Argument validity and execution reliability
🩻 Can it understand images? Multimodal evidence extraction
🔎 Can it retrieve useful knowledge? Search and evidence integration
🧮 Can it compute when needed? Numerical and symbolic reasoning
🧑‍⚕️ Can it answer clinically? Faithfulness, completeness, and final goal accuracy

MedCTA is designed to reveal not only whether an agent gets the answer right, but also how it reaches that answer.


📚 Dataset at a Glance

🌷 Item 💡 Value
Clinician-verified tasks 107
Executable tools 5
Benchmarked models 18
Autonomous rollouts 1,926
Human annotation time 321 hours
Anatomical regions / body systems 34
Tool steps per task 2–4
Average tool-execution steps 3.1

🧩 Task Format

Each task can be viewed as:

(X, Q, U, π, A)

where:

Symbol Meaning
X Multimodal clinical context
Q Step-implicit clinical query
U Hidden sufficient tool subset
π Reference interaction trajectory
A Final clinical outcome

This structure allows MedCTA to evaluate both the process and the final answer.


🧰 Tool Library

MedCTA uses five deployed tools that cover perception, retrieval, localization, and reasoning.

Tool Role Typical Use
OCR Text extraction Reading reports, labels, tables, scanned documents
ImageDescription Global visual understanding Summarizing medical images
RegionAttributeDescription Local visual inspection Describing specific regions or findings
GoogleSearch External evidence retrieval Looking up clinical or biomedical knowledge
Calculator Numerical reasoning Computing values, ratios, expressions, or scores

These tools make MedCTA closer to real clinical workflows, where an agent must actively gather and combine information before answering.


📏 Evaluation Metrics

MedCTA evaluates agents from three complementary angles.

🪜 1. Step-by-step Tool-use Fidelity

Metric Meaning
InstAcc Instruction-following accuracy
ToolAcc Tool-selection accuracy
ArgAcc Tool-argument prediction accuracy
SummAcc Intermediate summarization accuracy

🧑‍⚕️ 2. Clinical Reasoning Quality

Metric Meaning
Facc Clinical faithfulness
Cs Multimodal context integration
Scomp Semantic completeness

🎯 3. Final Outcome Accuracy

Metric Meaning
Gacc Final goal / answer accuracy

Together, these metrics help diagnose whether an agent failed because of poor tool selection, invalid arguments, weak evidence use, premature stopping, or incorrect clinical reasoning.


🏆 Leaderboard

Autonomous Clinical Tool-use Performance

* denotes closed-source/API-based evaluation.

💛 API-based and closed models
Model Developer Inst Tool Arg Summ Facc Cs Scomp Gacc
GPT-5.4* OpenAI 35.27 23.46 12.61 35.51 17.52 14.21 18.60 31.54
GPT-5.4-mini* OpenAI 5.36 6.74 3.23 0.93 16.47 10.56 17.29 28.31
GPT-5.4-nano* OpenAI 33.93 18.18 12.02 34.24 18.43 11.96 14.77 20.30
Claude-opus-4-6* Anthropic 24.78 8.80 0.59 39.25 14.11 14.86 23.83 31.32
Claude-sonnet-4-6* Anthropic 23.66 4.99 0.00 33.64 12.77 12.90 20.19 25.33
Claude-haiku-4-5* Anthropic 27.46 13.78 4.69 43.93 9.35 3.36 14.11 23.08
Gemini-3-flash* Google 3.35 17.30 0.00 5.61 11.31 8.60 15.98 25.87
Gemini-3-flash-lite* Google 2.90 8.21 0.00 1.87 10.75 6.82 14.58 23.64
💚 Open-source models
Model Developer Inst Tool Arg Summ Facc Cs Scomp Gacc
Qwen3.5-9B Qwen 44.20 14.37 13.78 29.91 10.37 17.10 13.36 21.64
Qwen3-8B Qwen 33.93 10.56 7.04 32.71 8.50 10.09 11.50 27.80
DeepSeek-R1-Distill-7B DeepSeek 10.49 3.52 0.00 7.48 2.62 0.84 3.36 10.61
Deepseek-llm-7b-chat DeepSeek 11.61 6.45 0.00 4.67 4.30 2.62 4.02 11.00
DeepSeek-V2-Lite-Chat DeepSeek 11.83 11.14 0.29 0.00 3.83 3.55 6.54 6.96
Llama-3.1-8B-Instruct Meta 23.66 7.92 0.00 6.54 7.94 5.42 11.21 18.94
Llama-3.2-3B-Instruct Meta 18.53 1.76 0.00 4.67 3.08 1.68 5.14 11.29
Mistral-7B Mistral 18.75 14.66 0.00 9.35 2.52 1.87 3.46 9.40
Phi-4 Microsoft 20.09 6.45 0.00 14.02 6.36 3.36 6.17 10.65
GPT-oss-20B OpenAI 1.79 0.00 0.00 0.00 1.31 0.56 1.68 3.18

🔍 Failure Diagnostics

MedCTA shows that clinical tool agents often struggle not only with final reasoning, but also with tool routing, protocol stability, and premature stopping.

🌱 Autonomous vs. Gold Tool Routing

Providing the gold tool route substantially improves final outcome accuracy, showing that tool planning remains a major bottleneck.

Model Auto Gacc Gold Gacc Gain
GPT-5.4 31.54 49.50 +17.96
Claude-opus-4-6 31.32 66.40 +35.08
Qwen3.5-9B 21.64 49.50 +27.86

🧪 Rollout-Level Diagnostics

Diagnostic Value Main Issue
API error rate 64.2% Protocol instability
Under-call rate 99.2% Premature stopping
Protocol failure 58.3% Rollout breakdown
Tool-selection failure 41.6% Incorrect actions

These results suggest that reliable clinical agents need stronger controllers, not just stronger language or vision backbones.


🚀 Evaluate on MedCTA

This repository follows the OpenCompass-style evaluation pipeline with AgentLego tools and LMDeploy model serving.

If you want to add a new agent wrapper or integrate a different LLM endpoint, see:

docs/ADDING_NEW_AGENT_OR_LLM.md

1. Prepare the MedCTA Dataset

Clone this repository.

git clone <ANONYMIZED_REPOSITORY_URL>
cd MedCTA

Create the dataset directory.

mkdir -p ./opencompass/data

Download the MedCTA dataset from the release file and place it under:

./opencompass/data/

The expected file structure is:

MedCTA/
├── agentlego
├── opencompass
│   ├── data
│   │   ├── medcta_dataset
│   ├── ...
├── ...

2. Prepare Your Model

2.1 Download model weights

pip install -U huggingface_hub

# huggingface-cli download --resume-download hugging/face/repo/name \
#   --local-dir your/local/path \
#   --local-dir-use-symlinks False

huggingface-cli download --resume-download Qwen/Qwen1.5-7B-Chat \
  --local-dir ~/models/qwen1.5-7b-chat \
  --local-dir-use-symlinks False

2.2 Install LMDeploy

conda create -n lmdeploy python=3.10
conda activate lmdeploy

For CUDA 12:

pip install lmdeploy

For CUDA 11+:

export LMDEPLOY_VERSION=0.4.0
export PYTHON_VERSION=310

pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl \
  --extra-index-url https://download.pytorch.org/whl/cu118

2.3 Launch a model service

# lmdeploy serve api_server path/to/your/model \
#   --server-port [port_number] \
#   --model-name [your_model_name]

lmdeploy serve api_server ~/models/qwen1.5-7b-chat \
  --server-port 12580 \
  --model-name qwen1.5-7b-chat

3. Deploy Tools

3.1 Install AgentLego

conda create -n agentlego python=3.11.9
conda activate agentlego

cd agentlego

pip install -r requirements_all.txt
pip install agentlego
pip install -e .

mim install mmengine
mim install mmcv==2.1.0

Then open:

~/anaconda3/envs/agentlego/lib/python3.11/site-packages/transformers/modeling_utils.py

and change:

_supports_sdpa = False

to:

_supports_sdpa = True

3.2 Configure API keys

To use the GoogleSearch and MathOCR tools, first obtain:

  • a Serper API key from https://serper.dev
  • a Mathpix API key from https://mathpix.com/

Then export them:

export SERPER_API_KEY='your_serper_key_for_google_search_tool'
export MATHPIX_APP_ID='your_mathpix_key_for_mathocr_tool'
export MATHPIX_APP_KEY='your_mathpix_key_for_mathocr_tool'

3.3 Start the tool server

agentlego-server start \
  --port 16181 \
  --extra ./benchmark.py `cat benchmark_toollist.txt` \
  --host 0.0.0.0

4. Start Evaluation

4.1 Install OpenCompass

conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass

cd agentlego
pip install -e .

cd ../opencompass
pip install -e .

Recommended package versions:

huggingface_hub==0.25.2
transformers==4.40.1

5. Configure the Evaluation

Modify:

configs/eval_medcta_bench.py

The IP and port of openai_api_base should match your LMDeploy model service.
The IP and port of tool_server should match your AgentLego tool server.

models = [
    dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        tool_server='http://10.140.0.138:16181',
        tool_meta='data/gta_dataset/toolmeta.json',
        batch_size=8,
    ),
]

6. Step-by-step Evaluation Mode

For step-by-step mode:

  • comment out tool_server
  • enable tool_meta
  • set infer and eval mode to every_with_gt

In:

configs/eval_medcta_bench.py

use:

models = [
    dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        # tool_server='http://10.140.0.138:16181',
        tool_meta='data/gta_dataset/toolmeta.json',
        batch_size=8,
    ),
]

In:

configs/datasets/gta_bench.py

use:

gta_bench_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template="""{questions}""",
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=AgentInferencer, infer_mode='every_with_gt'),
)

gta_bench_eval_cfg = dict(
    evaluator=dict(type=GTABenchEvaluator, mode='every_with_gt')
)

7. End-to-end Evaluation Mode

For end-to-end mode:

  • enable tool_server
  • comment out tool_meta
  • set infer and eval mode to every

In:

configs/eval_medcta_bench.py

use:

models = [
    dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        tool_server='http://10.140.0.138:16181',
        # tool_meta='data/gta_dataset/toolmeta.json',
        batch_size=8,
    ),
]

In:

configs/datasets/gta_bench.py

use:

gta_bench_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template="""{questions}""",
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=AgentInferencer, infer_mode='every'),
)

gta_bench_eval_cfg = dict(
    evaluator=dict(type=GTABenchEvaluator, mode='every')
)

8. Run Inference and Evaluation

Infer only

python run.py configs/eval_medcta_bench.py \
  --max-num-workers 32 \
  --debug \
  --mode infer

Evaluate only

# srun -p llmit -q auto python run.py configs/eval_medcta_bench.py \
#   --max-num-workers 32 \
#   --debug \
#   --reuse [time_stamp_of_prediction_file] \
#   --mode eval

srun -p llmit -q auto python run.py configs/eval_medcta_bench.py \
  --max-num-workers 32 \
  --debug \
  --reuse 20240628_115514 \
  --mode eval

Infer and evaluate

python run.py configs/eval_medcta_bench.py \
  -p llmit \
  -q auto \
  --max-num-workers 32 \
  --debug

🧡 Project Structure

MedCTA/
├── agentlego/                  # Tool deployment and AgentLego integration
├── docs/                       # Documentation for adding agents or LLMs
├── opencompass/                # OpenCompass evaluation framework
│   ├── data/
│   │   └── medcta_dataset/     # MedCTA dataset directory
│   └── configs/
├── clinical_accuracy.py        # Clinical reasoning evaluation
├── goal_accuracy.py            # Final goal accuracy evaluation
├── README.md
└── LICENSE.txt

📝 Citation

If you use MedCTA in your research, please cite the paper once the citation is available.

@misc{medcta2026,
  title        = {MedCTA: A Benchmark for Clinical Tool Agents},
  author       = {MedCTA Team},
  year         = {2026},
  note         = {Benchmark for clinician-verified multimodal clinical tool agents},
  url          = {https://ivul-kaust.github.io/MedCTA/}
}

💌 Acknowledgements

MedCTA builds on the OpenCompass-style evaluation ecosystem and AgentLego-based tool execution framework.

We thank the annotators, clinicians, and researchers who contributed to the benchmark design, validation, and evaluation.


🌷 MedCTA

Clinical tool agents should not only answer — they should observe, reason, verify, and act.


Made for careful, multimodal, clinically grounded agent evaluation.

About

MedCTA: A Benchmark for Clinical Tool Agents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.5%
  • CSS 0.8%
  • JavaScript 0.3%
  • Batchfile 0.1%
  • Makefile 0.1%
  • Shell 0.1%
  • Other 0.1%