[DARC Intern Project] Building Customizable LLM Evaluation Pipelines for Research

Stanford's AI Playground allows researchers and staff access to almost 20 LLMs via their API. How do users determine which model is best suited for their needs? We built a pipeline that delivers a clear ranking of how different models perform on a set of benchmarks related to processing table based images.

Context

Business and Social Science research, like the kind done at Stanford GSB, often requires data to be parsed from tables in scanned documents. These tables frequently have mixed resolutions, inconsistent formatting, and dense amounts of information which makes manual information extraction a time consuming task. Can we viably outsource this to LLMs?

LLM Decision Fatigue

Choosing the right LLM is a deceptively tough task. The best choice LLM often depends on the images being processed, the information that needs to be extracted, and the budget for a project. Researchers often don't have the time to test models individually. New versions of models come out rapidly and their predecessors get retired just as frequently, when is it worthwhile for a user to modify their existing workflows with more recent LLMs?

Our Approach

The goal of this project is to provide a personalized source of truth for LLM data parsing. As a proxy for typical research documents we used 34 scanned newspaper TV guides that varied in formatting and PDF clarity. These images were processed by 13 multimodal LLMs for the following 6 benchmarks:

- 1. Newspaper Name: The name of the newspaper the tv guide is published in.

- 2. Newspaper Date: The date the the newspaper was published on.

- 3. Day of Week: The day of the week the TV guide is for.

- 4. TV Guide Date: The date that the TV guide is for.

- 5. First Program: The name of the program for the first channel listed and the earliest time slot.

- 6. First Channel: The name of the first channel listed.

The unique combinations of images, models, and benchmarks amounted to 2,730 tasks that were fed through our pipeline.

Pipeline Overview

We designed the pipeline to be modular so that changes to inputs (models, images, benchmarks) could be easily made with minimal changes to the main script.

Try It Yourself

If you're interested in reproducing this worflow, or customizing it for your specfic benchmarks, you can follow the below steps.

Clone the Repository

git clone https://github.com/gsbdarc/LLM_benchmarks
cd LLM_benchmarks

Create and Activate a Virtual Environment (YENs)

/usr/bin/python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Create and Activate a Virtual Environment (Sherlock)

Request compute resources (normal, dev, or gsb) to create a venv.

salloc -p normal -t 1:00:00 -c 1

module load python/3.12
/usr/bin/python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Environment Variables

Create a .env file in the project root with:

STANFORD_API_KEY=your_key_here

⚠️ Note: as models get added & removed from the Stanford AI API you will need to submit a ticket to update your API key.

Update Inputs (`LLM_benchmarks/inputs/`)

`models.json`

Defines which LLMs should be used to process tasks and their model-specific configurations, add or remove as needed.

Example:

{
    "0" : {
        "model": "gpt-4",
        "family": "gpt",
        "max_context_input": 128000,
        "max_context_output": 4096,
        "max_context_window": 132096}
}

Note: max_context parameters are helpful for reference but not actually needed to run this pipeline.

`benchmarks.json`

Defines benchmark tasks executed by LLMs.

Each benchmark should include:

A unique ID
A benchmark task name
A system prompt
A user prompt
A benchmark task description
An expected output schema

Example:

{
    "0" : {
        "task_name": "newspaper_name",
        "system_prompt": "You are a metadata extraction assistant. Extract information from newspaper TV guide image. Always return valid JSON matching the exact schema provided.",
        "user_prompt": "Extract the newspaper name from this image.",
        "task_description": "Extraction: LLM should extract the name of the newspaper the TV guide is published in.",
        "schema":{
            "class_name": "NewspaperName", 
            "fields":{
                "newspaper_name": "str"}}}
}

Run scripts (`LLM_benchmarks/scripts/`)

Scripts should be run in the order that they are numbered.

`1_pdf_to_png.py`

Before running: upload PDFs to LLM_Benchmarks/inputs/data/pdfs/
Converts PDFS into grayscale PNGs, saves files to LLM_Benchmarks/iputs/data/pngs/.
Prints PNG paths and file sizes in MBs.

`2_make_index.py`

Before running: upload CSVs to LLM_Benchmarks/inputs/data/csvs/.
There should be a source of truth CSV for each PDF, naming should be the same excluding .png/.csv.
Creates a JSON snapshot of paths for PNGs and their source of truth CSVs.

`3_create_mapping.py`

Creates a mapping file that: (1) finds all unique combinations of selected benchmarks, models, and images (2) assigns a unique task id to each one (3) saves these results into a csv file to be used in main.py

`4_extract_ground_truth.py`

Context: ground truth csv files were created via human extraction.
Iterates through all of the CSVs in the image_index.
Creates/updates ground truth JSON with the correct benchmark values.

⚠️ Note: this script needs to be customized based on what the benchmarks are. An example of how to extract the 'Day' field of the first row is below.

day_of_week = csv_df['Day'][0]

`5_main.py`

Orchestrates processing of a single task.
Tasks are loaded via the mapping.csv file.
If the task has not already been processed then the corresponding benchmark, model, and image are loaded from their respective JSONs.
A pydantic model and prompts are passed into an LLM via the Stanford AI API.
The following outputs are saved to the DARC MongoDB shard

{
    "_id": "0_1",
    "task_id": "0",
    "run_id": 1,
    "output": "Arizona Republic",
    "model_name": "gpt-4",
    "model_id": "1",
    "image_id": "0",
    "benchmark_name": "newspaper_name",
    "benchmark_id": "0",
    "completion_tokens": 9,
    "total_tokens": 1196,
    "status": "processed",
    "run_number": 1,
    "updated_at": "April 1st, 12pm"
}

Depending on whether you're working in YENs or Sherlock edit the appropriate SLURM script and run the array job.

Example:

sbatch sherlock.slurm

`6_combine_check_results.py`

Loads all results within the LLM_Benchmarks/outputs/results/ directory.
Combines results into a single DataFrame.
Saves the results to LLM_Benchmarks/outputs/results/metrics/combined_results.json
Prints the total number of successful and unsuccessful tasks, returns dictionary of error messages with counts.

`7_compute.py`

Loads combined_results.json and filters for tasks that have been processed.
Evaluates model outputs compared to ground truth, assigns a accuracy score based on exact matching.
Saves results as a LLM_Benchmarks/outputs/results/metrics/metrics.json

Findings

We ran each task 3x with results showing that gemini-2.5-pro was the most accurate multimodal model in the Stanford AI Playground API at the time of testing (March 17th, 2026). We've now added Claude-4-5-Sonnet, Claude-Opus-4-6, Llama-4, and gpt-5.2 into our evaluation pipeline.

Accuracy by Model

Overall model accuracy across all benchmarks and images.

Accuracy by Benchmark

Simple metadata extraction benchmarks (Newspaper Name, Newspaper Date) had the highest accuracy while LLMs struggled to return accurate results for benchmarks that required some reasoning (TV Guide Date) and scanning the document (First Channel, First Program).

Double clicking into these document scanning benchmarks we found that the Gemini and Llama-4 models outperformed their peers.

Across runs accuracy rates by both benchmark and model remained stable. Where we saw the most variablity was in specific combinations of benchmarks and models - ex. first program with gemini-2.0-flash-001 had a standard deviation of 5.7%. The model temperatures had been set to 0 (the most deterministic) and with 35 images changes in response to 1 or 2 can lead to noticeable fluctuations in accuracy rates.

Accuracy by Image Id

When analyzing results by image id we found that there was about a 40% difference between the image associated with the lowest accuracy and the image associated with the highest accuracy.

Token Cost

Looking at model accuracy alongside total token cost showed that even though o1 was the 6th most accurate LLM it used almost 4.7x the amount of tokens as the most accurate model (gemini-2.5-pro).

Double clicking into metadata extraction benchmarks Newspaper Name and Newspaper Date found similar results could be produced for varying costs. Claude-3-haiku models returned Newspaper Name and Newspaper Datejust as accurately as gemini-2.5-pro (100%) but only cost $0.06 or 1/24 as much. Similarly gemini-2.0-flash-lite-001 got Newspaper Date correct 100% of the time but only cost $0.03 vs $1.49 for gemini-2.5-pro.

Limitations

While building and testing this pipeline we identified a couple of opportunties of improvement related to the Stanford AI Playground API.

Changes to avaiable models are not always consistently announced and often require manual combing through the ai-playground slack channel. Once these changes go into effect API keys need to be rerequested, otherwise they continue to reflect access to models that are no longer available. It would be helpful if API keys would automatically remove retired models and request forms would allow users to signal that they want access to all future models as well.

Next Steps

In order to create a more robust pipeline that delivers a comprehensive assessment of LLM capabilities we're planning on implementing the following:

[] Add external models to the pipeline
[] Modify pipeline to evaluate effectiveness of OCR + LLM data extraction with a variety of OCRs (Textract, LLAMA Index, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
images		images
inputs		inputs
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

[DARC Intern Project] Building Customizable LLM Evaluation Pipelines for Research

Table of Contents

Context

LLM Decision Fatigue

Our Approach

Pipeline Overview

Try It Yourself

Clone the Repository

Create and Activate a Virtual Environment (YENs)

Create and Activate a Virtual Environment (Sherlock)

Environment Variables

Update Inputs (LLM_benchmarks/inputs/)

models.json

benchmarks.json

Run scripts (LLM_benchmarks/scripts/)

1_pdf_to_png.py

2_make_index.py

3_create_mapping.py

4_extract_ground_truth.py

5_main.py

6_combine_check_results.py

7_compute.py

Findings

Accuracy by Model

Accuracy by Benchmark

Accuracy by Image Id

Token Cost

Limitations

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Update Inputs (`LLM_benchmarks/inputs/`)

`models.json`

`benchmarks.json`

Run scripts (`LLM_benchmarks/scripts/`)

`1_pdf_to_png.py`

`2_make_index.py`

`3_create_mapping.py`

`4_extract_ground_truth.py`

`5_main.py`

`6_combine_check_results.py`

`7_compute.py`

Packages