Skip to content

csml-rpi/SciConvBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

Overview

This repository contains the datasets and necessary scripts to run a conversational agent and evaluate its performance across various scientific domains, including fluid mechanics, partial differential equations (PDEs), solid mechanics, and materials science.

The core idea of this benchmark is to present a Large Language Model (LLM) with a scientific task that contains either ambiguities (missing keywords) or inconsistencies (contradictions). We are not testing the LLM's ability to solve the task. Instead, we are evaluating its ability to ask clarifying questions and recover a full, correct task specification devoid of any ambiguity or inconsistency.

Performance evaluation is based on both the final task specification generated by the agent and the conversational process it used to get there.

Dataset Access

The complete datasets used for evaluating the conversational agents are included directly within this repository. For convenience, data exploration, and direct downloading, they are also officially hosted on Kaggle: SCICONVBENCH Kaggle Dataset

Prerequisites and Setup

Ensure your environment is configured before running the experiments.

Python Requirements Install all required packages listed in the repository: pip install -r requirements.txt

API Keys Create a hidden .env file in the root directory to store your API keys. It should look like this:

OPENAI_API_KEY=your_openai_key_here
GEMINI_API_KEY=your_gemini_key_here

Model-Specific Configuration Certain models require additional setup beyond the standard .env file:

AWS Bedrock If running models via AWS Bedrock, you must export your AWS credentials to your terminal before initiating the run:

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_SESSION_TOKEN="your_session_token"

Claude SDK (claude_sdk_4_6_sonnet) If you do not have Bedrock access and are running Claude directly through the Anthropic SDK, you need the Claude Code CLI.

  1. Install the CLI: npm install -g @anthropic-ai/claude-code
  2. Authenticate: claude login

Codex GPT-5 (codex_gpt5) If running gpt-5.3-codex directly rather than through a standard OpenAI subscription:

  1. Install the Codex CLI: npm install -g @openai/codex
  2. Run codex in the terminal and follow the prompt to "Sign in with ChatGPT".

Local Ollama (ollama_local) If running models locally (e.g., gpt-oss:120b), you will need dedicated GPU access and Ollama installed. If you are using an HPC cluster (like NERSC or similar), the general workflow is:

  1. Request an interactive GPU compute node.
  2. Activate your respective conda/python environment.
  3. Start the Ollama server (e.g., ./ollama serve).
  4. In a separate terminal session connected to the same compute node, run your application to interface with the local model.

Directory Structure

Scientific Domain Directories

These directories contain the main datasets, runs, and results for each domain.

  • foam/, fluids/, pde/, matSci/, matToolUse/, solMech/, solToolUse/

Baseline Directories

These directories contain the runs and results from the baseline configuration of our agent. The baseline agent was an experiment to observe performance when the agent is not explicitly told in advance to look out for missing entities or contradictions.

  • foam_baseline/, fluids_baseline/, pde_baseline/, matSci_baseline/, matToolUse_baseline/, solMech_baseline/, solToolUse_baseline/

Additional Directories

  • clamber/: Contains a non-scientific literature dataset and runs. This tests how well the agent performs disambiguation tasks outside of scientific domains.
  • judge_ablation/, prompt_ablation/, user_llm_ablation/: Contain results from ablation tests proving that changing the judge model does not cause bias, switching the main explicit prompt to paraphrased versions maintains performance, and switching the user-simulated LLM does not alter main performance.
  • figs/: The output directory containing all generated figures (PNG and PDF formats) used in the paper.
  • tables/: The output directory containing all generated LaTeX tables used in the paper.

Core Scripts and Files

  • ALL_*.txt (e.g., ALL_FLUIDS.txt): Highly important files containing the exact, pre-configured Python commands needed to run the agent and evaluate performance for each respective dataset.
  • test_runner.py: The primary entry point for running the conversational agent.
  • config.py: Configuration file where you can choose which model acts as the judge for evaluations.
  • main.py: Contains the ConvAgent class (with system prompts) and can be run independently for simple, isolated tests.
  • llm_user_interface.py: Contains the logic for the user-simulated LLM, which answers the agent's questions strictly based on the ground truth.
  • utils.py: Helper functions for API invocation across different LLMs and token statistics tracking.
  • llm_judge_*.py: A suite of evaluation scripts for intent, disambiguation, and inconsistency metrics.
  • make_paper_artifacts.py: A single-file script that reads raw judge JSONs and automatically regenerates all figures and LaTeX tables for the paper.
  • ablation_runner.py / ablation_evaluation.py: Scripts for conducting and evaluating the ablation studies.
  • sanity_check.py: Scans all directories to ensure run files are present and flags any anomalies or errors.
  • test.py: A collection of supplementary scripts used for extra testing and verifying performances outside of the main experiment loop.

Internal Directory Architecture (Results & Data)

Inside each main scientific domain directory (e.g., foam/), the files are strictly organized.

Root Level of Domain Folder Contains two core JSON dataset files: disambiguation_[domain].json and inconsistency_[domain].json.

Model & Mode Folders At the same level as the dataset files, you will find one folder per model per mode. For example, GEMINI_2_5_PRO_Disambiguation and GEMINI_2_5_PRO_Inconsistency.

Summary Files (Inside Model/Mode Folders)

  • llm_judge_chat_summary.json: Conversation metrics across all cases.
  • llm_judge_summary.json: Actual task recovery/success metrics across all cases.
  • llm_judge_intent_summary.json: Faithfulness results across all cases.
  • disambiguation_coverage_summary.json: Exclusive to foam/fluids/pde disambiguation tasks (for other domains, this is integrated into the chat summary file).
  • Note: Ignore any "human summary" files present in these folders; they are deprecated.

Individual Case Folders Inside the Model/Mode folders are sub-folders for each specific case ID evaluated. If a dataset was trimmed, you might see case folders for IDs no longer in the main JSON—please ignore these. Each valid case folder contains:

  • complete_prompt.txt: The ground truth prompt.
  • generated_prompt.txt: The agent's final returned specification.
  • statistics.json: Token usage and the number of questions asked.
  • conversation_log.txt: The transcript of questions and answers. (This may be empty if the agent resolved the task without needing to ask questions).
  • Individual result and metric files from the LLM judges (identical to the data aggregated in the summary files).

Execution Workflow

Important Note: You do not need to manually figure out command-line arguments. Please refer directly to the ALL_*.txt files (e.g., ALL_FLUIDS.txt) for the exact, ready-to-use commands to run the agent and evaluations.

Step 1: Running the Agent

Run the agent using test_runner.py. You must specify the dataset mode, the output directory, the scientific domain (options: foam, fluids, pde, matSci, matToolUse, solMech, solToolUse), and the model.

The user-simulated agent will run on the same chosen model to answer questions quickly based only on ground truth, with strict instructions never to make assumptions.

Example Commands:

python test_runner.py --mode inconsistency --output_dir fluids/CLAUDE_SONNET_4.0_Inconsistency --domain fluids --model bedrock_4_0_sonnet

python test_runner.py --mode disambiguation --output_dir fluids/CLAUDE_SONNET_4.0_Disambiguation --domain fluids --model bedrock_4_0_sonnet

Step 2: Evaluating Performance

Once the runs are complete, evaluate the results using the judge scripts. First, manually override config.py to select the judging model. (Ablation tests confirm no judging bias when using frontier models like Sonnet 4.6, Gemini Pro, or GPT-5.2).

Evaluating Faithfulness (Intent) Run the intent judge for either dataset mode:

python llm_judge_intent.py --input_dir fluids/CLAUDE_SONNET_4.0_Disambiguation --json_path fluids/disambiguation_fluids.json

python llm_judge_intent.py --input_dir fluids/CLAUDE_SONNET_4.0_Inconsistency --json_path fluids/inconsistency_fluids.json

Evaluating Disambiguation Check if missing entities were successfully recovered:

python llm_judge_disambiguation.py --input_dir fluids/CLAUDE_SONNET_4.0_Disambiguation --json_path fluids/disambiguation_fluids.json

Evaluate the conversation metrics (grounded conversation rate, clarification recall, precision, etc.):

python llm_judge_disambiguation_chat.py --input_dir fluids/CLAUDE_SONNET_4.0_Disambiguation --json_path fluids/disambiguation_fluids.json

Evaluating Inconsistency Check if contradictions were successfully resolved:

python llm_judge_inconsistency.py --input_dir fluids/CLAUDE_SONNET_4.0_Inconsistency --json_path fluids/inconsistency_fluids.json

Evaluate the conversation metrics for the inconsistency run:

python llm_judge_inconsistency_chat.py --input_dir fluids/CLAUDE_SONNET_4.0_Inconsistency --json_path fluids/inconsistency_fluids.json

Step 3: Generating Paper Artifacts (Figures & Tables)

To generate the figures and LaTeX tables used in the paper, run the artifact generation script.

This script directly parses the raw per-case JSON outputs and builds the dataframes in memory, meaning no intermediate CSVs are required.

python make_paper_artifacts.py --root .

About

Conversational Science Benchmark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors