Skip to content

claws-lab/XLingEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

51 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŒ XLingEval

Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

Web Conference 2024 arXiv Website HF Dataset Video DOI License Python GitHub stars

Yiqiao Jinยน*, Mohit Chandraยน*, Gaurav Vermaยน, Yibo Huยน, Munmun De Choudhuryยน, Srijan Kumarยน

ยน Georgia Institute of Technology

* Equal contribution


๐Ÿ“– Abstract

Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems. This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XLingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages โ€” English, Spanish, Chinese, and Hindi โ€” spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XLingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context.

โœจ Highlights

  • ๐ŸŒ First cross-lingual healthcare benchmark โ€” XLingHealth covers 4 major world languages ร— 3 expert-annotated Q&A datasets (HealthQA, LiveQA, MedicationQA).
  • ๐Ÿ“‰ Pronounced English bias quantified โ€” GPT-3.5 produces 18.12% fewer comprehensive answers and is 5.82ร— more likely to give an incorrect response in non-English languages.
  • โš–๏ธ Three-axis evaluation framework โ€” XLingEval unifies correctness, consistency, and verifiability, combining algorithmic metrics with expert human evaluation.
  • ๐Ÿ”ฌ Multi-model coverage โ€” Validated across GPT-3.5, GPT-4, and the open-source MedAlpaca family (7B / 13B / 30B).
  • ๐ŸŒ Steep degradation in under-represented languages โ€” Semantic consistency drops 9.1% (es) / 28.3% (zh) / 50.5% (hi) vs English; verifiability Macro-F1 drops up to 23.4% (hi).
  • ๐Ÿงฐ Generalizable beyond healthcare โ€” The same correctness / consistency / verifiability lens applies to legal, financial, and educational dialogue.

๐Ÿงญ The XLingEval Framework

XLingEval evaluates LLM responses along three healthcare-critical axes. Each axis combines automated metrics with human evaluation by medical annotators across all four languages.

Axis What it measures Key metrics
โœ… Correctness Whether the LLM's answer matches expert ground-truth LLM-judge comparative analysis (CoT prompting), human evaluation
๐Ÿ” Consistency Whether the LLM gives stable answers under sampling n-gram & length (surface), BERTScore & SBERT (semantic), LDA / HDP (topic)
๐Ÿ”Ž Verifiability Whether the LLM can authenticate medical claims Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC

๐ŸŒ Supported Languages

Language Code Role Translation source
๐Ÿ‡ฌ๐Ÿ‡ง English en Baseline Native
๐Ÿ‡ช๐Ÿ‡ธ Spanish es Cross-lingual eval MT + human verification
๐Ÿ‡จ๐Ÿ‡ณ Simplified Chinese zh Cross-lingual eval MT + human verification
๐Ÿ‡ฎ๐Ÿ‡ณ Hindi hi Cross-lingual eval MT + human verification

๐Ÿค– Supported Models

Family Variants Access
GPT gpt-3.5-turbo, gpt-4 OpenAI API
MedAlpaca medalpaca-7b, medalpaca-13b, medalpaca-30b Open source (HF)

๐Ÿ“Š XLingHealth Dataset

The XLingHealth_Dataset/ folder in the repository root contains the cross-lingual benchmark versions of HealthQA, LiveQA, and MedicationQA as Excel files, with separate tabs for each of the four languages (English, Spanish, Chinese, Hindi).

๐Ÿค— The dataset is also published on Hugging Face: claws-lab/XLingHealth.

Dataset #Examples #Words (Q) #Words (A)
HealthQA 1,134 7.72 ยฑ 2.41 242.85 ยฑ 221.88
LiveQA 246 41.76 ยฑ 37.38 115.25 ยฑ 112.75
MedicationQA 690 6.86 ยฑ 2.83 61.50 ยฑ 69.44
  • #Words (Q) and #Words (A) are the average word counts in the questions and ground-truth answers respectively.
  • In HealthQA, each question is paired with 1 positive and 9 negative answers โ€” total 11,340 examples.
  • LiveQA and MedicationQA do not provide negatives; we sample 4 negatives per question, yielding totals of 1,230 and 3,450 examples respectively.

๐Ÿš€ Installation

Create a new conda environment:

conda create -n xlingeval python=3.9
conda activate xlingeval

Install dependencies:

pip install -r requirements.txt

โšก Quick Start

1. Correctness Experiments

1.1 Evaluation using GPT-3.5

Retrieve answers from GPT-3.5:

python correctness/correctness_get_gpt_answer.py \
    --dataset_path <path to the dataset> \
    --model gpt-3.5-turbo

Evaluate the quality between the ground-truth and the LLM answer:

python correctness/correctness_answer_evaluation.py \
    --dataset_path <path to the dataset> \
    --model gpt-3.5-turbo

1.2 Evaluation using MedAlpaca

Retrieve answers from MedAlpaca:

python correctness/MedAlpaca/correctness_medalpaca_get_answers.py \
    --dataset_path <path to the dataset> \
    --model medalpaca-30b \
    --batch_size 5

Evaluate the MedAlpaca answers using GPT-3.5 as a judge:

python correctness/correctness_answer_evaluation.py \
    --dataset_path <path to the dataset with MedAlpaca llm answers> \
    --model gpt-3.5-turbo

2. Consistency Experiments

Run all commands from the repository root XLingEval/.

  • Generate answers with multiple samplings:

    python consistency/consistency_get_gpt_answer.py \
        --dataset <DATASET> --model <MODEL> --num_answers <NUM_ANSWERS>
    • dataset: healthqa ยท liveqa ยท medicationqa
    • model: gpt35 ยท gpt4 ยท medalpaca-7b ยท medalpaca-13b ยท medalpaca-30b
    • num_answers: number of samples per question

    Example:

    python consistency/consistency_get_gpt_answer.py \
        --dataset liveqa --model gpt35 --num_answers 10
  • Translate the sampled answers back into English:

    python consistency/translate.py \
        --dataset <DATASET> --model <MODEL> --num_answers <NUM_ANSWERS>
  • Evaluate consistency metrics:

    python consistency/consistency_answer_evaluation.py \
        --dataset <DATASET> --model <MODEL> --num_answers <NUM_ANSWERS>

    Results are written to outputs/consistency/.

3. Verifiability Experiments

Both GPT-3.5/4 and MedAlpaca share the same code path.

  • Prompt the LLM to verify each (question, answer) pair:

    python verifiability/verifiability_get_answer.py \
        --dataset <DATASET> --model <MODEL>

    By default, all four languages (en, es, zh, hi) are evaluated.

  • Summarize verifiability metrics:

    python verifiability/verifiability_answer_evaluation.py \
        --dataset <DATASET> --model <MODEL>

    Results are written to outputs/verifiability/.

๐Ÿ—‚๏ธ Repository Structure

XLingEval/
โ”œโ”€โ”€ correctness/        # Correctness pipeline (GPT-3.5/4 & MedAlpaca)
โ”œโ”€โ”€ consistency/        # Consistency pipeline (sampling, translation, scoring)
โ”œโ”€โ”€ verifiability/      # Verifiability pipeline (claim authentication)
โ”œโ”€โ”€ translate/          # Translation utilities (ChatGPT-based)
โ”œโ”€โ”€ dataloader/         # Dataset loading & preprocessing
โ”œโ”€โ”€ utils/              # Metrics, data utilities, miscellaneous helpers
โ”œโ”€โ”€ visual/             # Plots: line, heatmap, boxplot
โ”œโ”€โ”€ XLingHealth_Dataset/  # Cross-lingual benchmark Excel files
โ”œโ”€โ”€ outputs/            # Experiment outputs (correctness, consistency, verifiability)
โ”œโ”€โ”€ static/, media/     # Project page assets
โ””โ”€โ”€ index.html          # Project website (open in browser)

Module highlights

  • correctness/ โ€” correctness_get_gpt_answer.py, correctness_answer_evaluation.py, MedAlpaca/correctness_medalpaca_get_answers.py
  • consistency/ โ€” consistency_get_gpt_answer.py, translate.py, consistency_answer_evaluation.py, statistical_test.py
  • verifiability/ โ€” verifiability_get_answer.py, verifiability_answer_evaluation.py, prompts.py
  • utils/ โ€” metrics.py, utils_data.py, utils_misc.py
  • visual/ โ€” line plots, heatmaps, and boxplots used in the paper

๐Ÿ“ Citation

If you find XLingEval or XLingHealth useful in your research, please cite:

@inproceedings{jin2024better,
  title={Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries},
  author={Jin, Yiqiao and Chandra, Mohit and Verma, Gaurav and Hu, Yibo and De Choudhury, Munmun and Kumar, Srijan},
  booktitle={Proceedings of the ACM Web Conference 2024},
  pages={2627--2638},
  year={2024}
}

๐Ÿ™ Acknowledgements

This work was supported in part by NSF (CNS-2154118, ITE-2137724, ITE-2230692, CNS-2239879), DARPA (HR00112290102, subcontract PO70745), CDC, and Microsoft. We thank our medical annotators for the cross-lingual evaluation effort.

โš–๏ธ License

This project is released under the Apache License 2.0.

About

Code and Resources for the paper, "Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors