Yiqiao Jinยน*, Mohit Chandraยน*, Gaurav Vermaยน, Yibo Huยน, Munmun De Choudhuryยน, Srijan Kumarยน
ยน Georgia Institute of Technology
* Equal contribution
Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems. This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XLingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages โ English, Spanish, Chinese, and Hindi โ spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XLingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context.
- ๐ First cross-lingual healthcare benchmark โ XLingHealth covers 4 major world languages ร 3 expert-annotated Q&A datasets (HealthQA, LiveQA, MedicationQA).
- ๐ Pronounced English bias quantified โ GPT-3.5 produces 18.12% fewer comprehensive answers and is 5.82ร more likely to give an incorrect response in non-English languages.
- โ๏ธ Three-axis evaluation framework โ XLingEval unifies correctness, consistency, and verifiability, combining algorithmic metrics with expert human evaluation.
- ๐ฌ Multi-model coverage โ Validated across GPT-3.5, GPT-4, and the open-source MedAlpaca family (7B / 13B / 30B).
- ๐ Steep degradation in under-represented languages โ Semantic consistency drops 9.1% (es) / 28.3% (zh) / 50.5% (hi) vs English; verifiability Macro-F1 drops up to 23.4% (hi).
- ๐งฐ Generalizable beyond healthcare โ The same correctness / consistency / verifiability lens applies to legal, financial, and educational dialogue.
XLingEval evaluates LLM responses along three healthcare-critical axes. Each axis combines automated metrics with human evaluation by medical annotators across all four languages.
| Axis | What it measures | Key metrics |
|---|---|---|
| โ Correctness | Whether the LLM's answer matches expert ground-truth | LLM-judge comparative analysis (CoT prompting), human evaluation |
| ๐ Consistency | Whether the LLM gives stable answers under sampling | n-gram & length (surface), BERTScore & SBERT (semantic), LDA / HDP (topic) |
| ๐ Verifiability | Whether the LLM can authenticate medical claims | Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC |
| Language | Code | Role | Translation source |
|---|---|---|---|
| ๐ฌ๐ง English | en |
Baseline | Native |
| ๐ช๐ธ Spanish | es |
Cross-lingual eval | MT + human verification |
| ๐จ๐ณ Simplified Chinese | zh |
Cross-lingual eval | MT + human verification |
| ๐ฎ๐ณ Hindi | hi |
Cross-lingual eval | MT + human verification |
| Family | Variants | Access |
|---|---|---|
| GPT | gpt-3.5-turbo, gpt-4 |
OpenAI API |
| MedAlpaca | medalpaca-7b, medalpaca-13b, medalpaca-30b |
Open source (HF) |
The XLingHealth_Dataset/ folder in the repository root contains the cross-lingual benchmark versions of HealthQA, LiveQA, and MedicationQA as Excel files, with separate tabs for each of the four languages (English, Spanish, Chinese, Hindi).
๐ค The dataset is also published on Hugging Face: claws-lab/XLingHealth.
| Dataset | #Examples | #Words (Q) | #Words (A) |
|---|---|---|---|
| HealthQA | 1,134 | 7.72 ยฑ 2.41 | 242.85 ยฑ 221.88 |
| LiveQA | 246 | 41.76 ยฑ 37.38 | 115.25 ยฑ 112.75 |
| MedicationQA | 690 | 6.86 ยฑ 2.83 | 61.50 ยฑ 69.44 |
#Words (Q)and#Words (A)are the average word counts in the questions and ground-truth answers respectively.- In HealthQA, each question is paired with 1 positive and 9 negative answers โ total 11,340 examples.
- LiveQA and MedicationQA do not provide negatives; we sample 4 negatives per question, yielding totals of 1,230 and 3,450 examples respectively.
Create a new conda environment:
conda create -n xlingeval python=3.9
conda activate xlingevalInstall dependencies:
pip install -r requirements.txtRetrieve answers from GPT-3.5:
python correctness/correctness_get_gpt_answer.py \
--dataset_path <path to the dataset> \
--model gpt-3.5-turboEvaluate the quality between the ground-truth and the LLM answer:
python correctness/correctness_answer_evaluation.py \
--dataset_path <path to the dataset> \
--model gpt-3.5-turboRetrieve answers from MedAlpaca:
python correctness/MedAlpaca/correctness_medalpaca_get_answers.py \
--dataset_path <path to the dataset> \
--model medalpaca-30b \
--batch_size 5Evaluate the MedAlpaca answers using GPT-3.5 as a judge:
python correctness/correctness_answer_evaluation.py \
--dataset_path <path to the dataset with MedAlpaca llm answers> \
--model gpt-3.5-turboRun all commands from the repository root XLingEval/.
-
Generate answers with multiple samplings:
python consistency/consistency_get_gpt_answer.py \ --dataset <DATASET> --model <MODEL> --num_answers <NUM_ANSWERS>dataset:healthqaยทliveqaยทmedicationqamodel:gpt35ยทgpt4ยทmedalpaca-7bยทmedalpaca-13bยทmedalpaca-30bnum_answers: number of samples per question
Example:
python consistency/consistency_get_gpt_answer.py \ --dataset liveqa --model gpt35 --num_answers 10 -
Translate the sampled answers back into English:
python consistency/translate.py \ --dataset <DATASET> --model <MODEL> --num_answers <NUM_ANSWERS> -
Evaluate consistency metrics:
python consistency/consistency_answer_evaluation.py \ --dataset <DATASET> --model <MODEL> --num_answers <NUM_ANSWERS>Results are written to
outputs/consistency/.
Both GPT-3.5/4 and MedAlpaca share the same code path.
-
Prompt the LLM to verify each (question, answer) pair:
python verifiability/verifiability_get_answer.py \ --dataset <DATASET> --model <MODEL>By default, all four languages (
en,es,zh,hi) are evaluated. -
Summarize verifiability metrics:
python verifiability/verifiability_answer_evaluation.py \ --dataset <DATASET> --model <MODEL>Results are written to
outputs/verifiability/.
XLingEval/
โโโ correctness/ # Correctness pipeline (GPT-3.5/4 & MedAlpaca)
โโโ consistency/ # Consistency pipeline (sampling, translation, scoring)
โโโ verifiability/ # Verifiability pipeline (claim authentication)
โโโ translate/ # Translation utilities (ChatGPT-based)
โโโ dataloader/ # Dataset loading & preprocessing
โโโ utils/ # Metrics, data utilities, miscellaneous helpers
โโโ visual/ # Plots: line, heatmap, boxplot
โโโ XLingHealth_Dataset/ # Cross-lingual benchmark Excel files
โโโ outputs/ # Experiment outputs (correctness, consistency, verifiability)
โโโ static/, media/ # Project page assets
โโโ index.html # Project website (open in browser)
correctness/โcorrectness_get_gpt_answer.py,correctness_answer_evaluation.py,MedAlpaca/correctness_medalpaca_get_answers.pyconsistency/โconsistency_get_gpt_answer.py,translate.py,consistency_answer_evaluation.py,statistical_test.pyverifiability/โverifiability_get_answer.py,verifiability_answer_evaluation.py,prompts.pyutils/โmetrics.py,utils_data.py,utils_misc.pyvisual/โ line plots, heatmaps, and boxplots used in the paper
If you find XLingEval or XLingHealth useful in your research, please cite:
@inproceedings{jin2024better,
title={Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries},
author={Jin, Yiqiao and Chandra, Mohit and Verma, Gaurav and Hu, Yibo and De Choudhury, Munmun and Kumar, Srijan},
booktitle={Proceedings of the ACM Web Conference 2024},
pages={2627--2638},
year={2024}
}This work was supported in part by NSF (CNS-2154118, ITE-2137724, ITE-2230692, CNS-2239879), DARPA (HR00112290102, subcontract PO70745), CDC, and Microsoft. We thank our medical annotators for the cross-lingual evaluation effort.
This project is released under the Apache License 2.0.
