SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

Overview

This repository contains the datasets and necessary scripts to run a conversational agent and evaluate its performance across various scientific domains, including fluid mechanics, partial differential equations (PDEs), solid mechanics, and materials science.

The core idea of this benchmark is to present a Large Language Model (LLM) with a scientific task that contains either ambiguities (missing keywords) or inconsistencies (contradictions). We are not testing the LLM's ability to solve the task. Instead, we are evaluating its ability to ask clarifying questions and recover a full, correct task specification devoid of any ambiguity or inconsistency.

Performance evaluation is based on both the final task specification generated by the agent and the conversational process it used to get there.

Dataset Access

The complete datasets used for evaluating the conversational agents are included directly within this repository. For convenience, data exploration, and direct downloading, they are also officially hosted on Kaggle: SCICONVBENCH Kaggle Dataset

Prerequisites and Setup

Ensure your environment is configured before running the experiments.

Python Requirements Install all required packages listed in the repository: pip install -r requirements.txt

API Keys Create a hidden .env file in the root directory to store your API keys. It should look like this:

OPENAI_API_KEY=your_openai_key_here
GEMINI_API_KEY=your_gemini_key_here

Model-Specific Configuration Certain models require additional setup beyond the standard .env file:

AWS Bedrock If running models via AWS Bedrock, you must export your AWS credentials to your terminal before initiating the run:

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_SESSION_TOKEN="your_session_token"

Claude SDK (claude_sdk_4_6_sonnet) If you do not have Bedrock access and are running Claude directly through the Anthropic SDK, you need the Claude Code CLI.

Install the CLI: npm install -g @anthropic-ai/claude-code
Authenticate: claude login

Codex GPT-5 (codex_gpt5) If running gpt-5.3-codex directly rather than through a standard OpenAI subscription:

Install the Codex CLI: npm install -g @openai/codex
Run codex in the terminal and follow the prompt to "Sign in with ChatGPT".

Local Ollama (ollama_local) If running models locally (e.g., gpt-oss:120b), you will need dedicated GPU access and Ollama installed. If you are using an HPC cluster (like NERSC or similar), the general workflow is:

Request an interactive GPU compute node.
Activate your respective conda/python environment.
Start the Ollama server (e.g., ./ollama serve).
In a separate terminal session connected to the same compute node, run your application to interface with the local model.

Directory Structure

Scientific Domain Directories

These directories contain the main datasets, runs, and results for each domain.

foam/, fluids/, pde/, matSci/, matToolUse/, solMech/, solToolUse/

Baseline Directories

These directories contain the runs and results from the baseline configuration of our agent. The baseline agent was an experiment to observe performance when the agent is not explicitly told in advance to look out for missing entities or contradictions.

foam_baseline/, fluids_baseline/, pde_baseline/, matSci_baseline/, matToolUse_baseline/, solMech_baseline/, solToolUse_baseline/

Additional Directories

clamber/: Contains a non-scientific literature dataset and runs. This tests how well the agent performs disambiguation tasks outside of scientific domains.
judge_ablation/, prompt_ablation/, user_llm_ablation/: Contain results from ablation tests proving that changing the judge model does not cause bias, switching the main explicit prompt to paraphrased versions maintains performance, and switching the user-simulated LLM does not alter main performance.
figs/: The output directory containing all generated figures (PNG and PDF formats) used in the paper.
tables/: The output directory containing all generated LaTeX tables used in the paper.

Core Scripts and Files

ALL_*.txt (e.g., ALL_FLUIDS.txt): Highly important files containing the exact, pre-configured Python commands needed to run the agent and evaluate performance for each respective dataset.
test_runner.py: The primary entry point for running the conversational agent.
config.py: Configuration file where you can choose which model acts as the judge for evaluations.
main.py: Contains the ConvAgent class (with system prompts) and can be run independently for simple, isolated tests.
llm_user_interface.py: Contains the logic for the user-simulated LLM, which answers the agent's questions strictly based on the ground truth.
utils.py: Helper functions for API invocation across different LLMs and token statistics tracking.
llm_judge_*.py: A suite of evaluation scripts for intent, disambiguation, and inconsistency metrics.
make_paper_artifacts.py: A single-file script that reads raw judge JSONs and automatically regenerates all figures and LaTeX tables for the paper.
ablation_runner.py / ablation_evaluation.py: Scripts for conducting and evaluating the ablation studies.
sanity_check.py: Scans all directories to ensure run files are present and flags any anomalies or errors.
test.py: A collection of supplementary scripts used for extra testing and verifying performances outside of the main experiment loop.

Internal Directory Architecture (Results & Data)

Inside each main scientific domain directory (e.g., foam/), the files are strictly organized.

Root Level of Domain Folder Contains two core JSON dataset files: disambiguation_[domain].json and inconsistency_[domain].json.

Model & Mode Folders At the same level as the dataset files, you will find one folder per model per mode. For example, GEMINI_2_5_PRO_Disambiguation and GEMINI_2_5_PRO_Inconsistency.

Summary Files (Inside Model/Mode Folders)

llm_judge_chat_summary.json: Conversation metrics across all cases.
llm_judge_summary.json: Actual task recovery/success metrics across all cases.
llm_judge_intent_summary.json: Faithfulness results across all cases.
disambiguation_coverage_summary.json: Exclusive to foam/fluids/pde disambiguation tasks (for other domains, this is integrated into the chat summary file).
Note: Ignore any "human summary" files present in these folders; they are deprecated.

Individual Case Folders Inside the Model/Mode folders are sub-folders for each specific case ID evaluated. If a dataset was trimmed, you might see case folders for IDs no longer in the main JSON—please ignore these. Each valid case folder contains:

complete_prompt.txt: The ground truth prompt.
generated_prompt.txt: The agent's final returned specification.
statistics.json: Token usage and the number of questions asked.
conversation_log.txt: The transcript of questions and answers. (This may be empty if the agent resolved the task without needing to ask questions).
Individual result and metric files from the LLM judges (identical to the data aggregated in the summary files).

Execution Workflow

Important Note: You do not need to manually figure out command-line arguments. Please refer directly to the ALL_*.txt files (e.g., ALL_FLUIDS.txt) for the exact, ready-to-use commands to run the agent and evaluations.

Step 1: Running the Agent

Run the agent using test_runner.py. You must specify the dataset mode, the output directory, the scientific domain (options: foam, fluids, pde, matSci, matToolUse, solMech, solToolUse), and the model.

The user-simulated agent will run on the same chosen model to answer questions quickly based only on ground truth, with strict instructions never to make assumptions.

Example Commands:

python test_runner.py --mode inconsistency --output_dir fluids/CLAUDE_SONNET_4.0_Inconsistency --domain fluids --model bedrock_4_0_sonnet

python test_runner.py --mode disambiguation --output_dir fluids/CLAUDE_SONNET_4.0_Disambiguation --domain fluids --model bedrock_4_0_sonnet

Step 2: Evaluating Performance

Once the runs are complete, evaluate the results using the judge scripts. First, manually override config.py to select the judging model. (Ablation tests confirm no judging bias when using frontier models like Sonnet 4.6, Gemini Pro, or GPT-5.2).

Evaluating Faithfulness (Intent) Run the intent judge for either dataset mode:

python llm_judge_intent.py --input_dir fluids/CLAUDE_SONNET_4.0_Disambiguation --json_path fluids/disambiguation_fluids.json

python llm_judge_intent.py --input_dir fluids/CLAUDE_SONNET_4.0_Inconsistency --json_path fluids/inconsistency_fluids.json

Evaluating Disambiguation Check if missing entities were successfully recovered:

python llm_judge_disambiguation.py --input_dir fluids/CLAUDE_SONNET_4.0_Disambiguation --json_path fluids/disambiguation_fluids.json

Evaluate the conversation metrics (grounded conversation rate, clarification recall, precision, etc.):

python llm_judge_disambiguation_chat.py --input_dir fluids/CLAUDE_SONNET_4.0_Disambiguation --json_path fluids/disambiguation_fluids.json

Evaluating Inconsistency Check if contradictions were successfully resolved:

python llm_judge_inconsistency.py --input_dir fluids/CLAUDE_SONNET_4.0_Inconsistency --json_path fluids/inconsistency_fluids.json

Evaluate the conversation metrics for the inconsistency run:

python llm_judge_inconsistency_chat.py --input_dir fluids/CLAUDE_SONNET_4.0_Inconsistency --json_path fluids/inconsistency_fluids.json

Step 3: Generating Paper Artifacts (Figures & Tables)

To generate the figures and LaTeX tables used in the paper, run the artifact generation script.

This script directly parses the raw per-case JSON outputs and builds the dataframes in memory, meaning no intermediate CSVs are required.

python make_paper_artifacts.py --root .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

Overview

Dataset Access

Prerequisites and Setup

Directory Structure

Scientific Domain Directories

Baseline Directories

Additional Directories

Core Scripts and Files

Internal Directory Architecture (Results & Data)

Execution Workflow

Step 1: Running the Agent

Step 2: Evaluating Performance

Step 3: Generating Paper Artifacts (Figures & Tables)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
clamber		clamber
figs		figs
fluids		fluids
fluids_baseline		fluids_baseline
foam		foam
foam_baseline		foam_baseline
judge_ablation		judge_ablation
matSci		matSci
matSci_baseline		matSci_baseline
matToolUse		matToolUse
matToolUse_baseline		matToolUse_baseline
pde		pde
pde_baseline		pde_baseline
prompt_ablation		prompt_ablation
solMech		solMech
solMech_baseline		solMech_baseline
solToolUse		solToolUse
solToolUse_baseline		solToolUse_baseline
tables		tables
user_llm_ablation		user_llm_ablation
.DS_Store		.DS_Store
ALL_CLAMBER.txt		ALL_CLAMBER.txt
ALL_FLUIDS.txt		ALL_FLUIDS.txt
ALL_FOAM.txt		ALL_FOAM.txt
ALL_MATSCI.txt		ALL_MATSCI.txt
ALL_MATTOOL.txt		ALL_MATTOOL.txt
ALL_PDE.txt		ALL_PDE.txt
ALL_SOLMECH.txt		ALL_SOLMECH.txt
ALL_SOLTOOL.txt		ALL_SOLTOOL.txt
README.md		README.md
ablation_evaluation.py		ablation_evaluation.py
ablation_runner.py		ablation_runner.py
config.py		config.py
llm_judge_disambiguation.py		llm_judge_disambiguation.py
llm_judge_disambiguation_chat.py		llm_judge_disambiguation_chat.py
llm_judge_inconsistency.py		llm_judge_inconsistency.py
llm_judge_inconsistency_chat.py		llm_judge_inconsistency_chat.py
llm_judge_intent.py		llm_judge_intent.py
llm_user_interface.py		llm_user_interface.py
main.py		main.py
make_paper_artifacts.py		make_paper_artifacts.py
requirements.txt		requirements.txt
sanity_check.py		sanity_check.py
test.py		test.py
test_runner.py		test_runner.py
utils.py		utils.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

Overview

Dataset Access

Prerequisites and Setup

Directory Structure

Scientific Domain Directories

Baseline Directories

Additional Directories

Core Scripts and Files

Internal Directory Architecture (Results & Data)

Execution Workflow

Step 1: Running the Agent

Step 2: Evaluating Performance

Step 3: Generating Paper Artifacts (Figures & Tables)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages