Skip to content

yingjiahao14/Dual-Eval

Repository files navigation

Disentangling Language Medium and Culture Context for Evaluating Multilingual Large Language Models


arXiv GitHub License: MIT Project Page

This repository contains the implementation code and associated data for our work, "Disentangling Language Medium and Cultural Context for Evaluating Multilingual Large Language Models," published in 2025 ACL Main.

Brief Introduction

This repository provides the code and data for our paper on multilingual LLMs evaluation. We propose a Dual Evaluation Framework that separately considers linguistic medium and cultural context, enabling more nuanced and comprehensive assessment of LLMs across languages and cultures. Our experiments uncover a "Cultural-Linguistic Synergy" phenomenon—LLMs perform better when the question’s cultural background matches the language. Further analysis suggests that the proportion of activated neurons can indicate model performance in multilingual and multicultural settings. Our findings highlight the importance of both cultural and linguistic factors in LLM evaluation.

dual-eval

Prerequisites

Step 1: Clone the Repository and Create Environment

git clone https://github.com/yingjiahao14/Dual-Eval
cd Dual-Eval

The environment setup for this project follows the guidelines and dependencies outlined in BLEnD. Please refer to their documentation for detailed instructions on configuring the environment and installing required evaluation dependencies.

Step 2: Install Target Tested Models (e.g., Llama-3)

# Download the Llama-3 model from Hugging Face and store it locally
huggingface-cli download --resume-download meta-llama/Llama-3-8B-Instruct --local-dir model/Llama-3-8B-Instruct

If you have already downloaded the models, you may need to update the corresponding paths in utils.py to ensure they point to the correct locations.

Usage

Model Inference: Execute the command below to run model inference:

bash model_inference.sh [OPTIONS]

You can customize the inference process by adding the following command-line arguments:

  • --cuda-devices: Specify which GPU(s) to use (e.g., "0", "0,1").

    Example: --cuda-devices "0,1"

  • --model-keys: Provide a comma-separated list of model names to use for inference.

    Example: --model-keys "Llama3-8b-Instruct,gemma-2-9b-it"

  • --country-lang: Specify country-language mappings. Use a comma to separate entries, and a colon to separate country and languages (languages separated by semicolons if multiple).

    Example: --country-lang "China:China,UK,US:US,China"

  • --prompt-numbers: Specify which prompt numbers to use (comma-separated).

    Example: --prompt-numbers "inst-4"

Note: If you do not specify these parameters, default values defined in the script will be used.

Model Evaluation: To evaluate model outputs, navigate to the evaluation directory and execute:

cd evaluation/get_performance
bash evaluate.sh [OPTIONS]

You can customize the evaluation process by specifying command-line arguments as shown below.

  • --cuda-devices

    Specify which GPU(s) to use (e.g., "0", "0,1").

    Example: --cuda-devices "0"

  • --model-keys

    Comma-separated list of model names to evaluate.

    Example: --model-keys "Llama3-8b-Instruct,gemma-2-9b-it"

  • --country-lang

    Specify country-language mappings. Use a comma to separate entries, and a colon to separate country and languages (languages separated by commas if multiple).

    Example: --country-lang "China:China,UK,US:US,China"

  • --prompt-numbers

    Comma-separated list of prompt numbers to use for evaluation.

    Example: --prompt-numbers "inst-8"

Specialized Neurons Calculation: To calculate specialized neurons for your models, follow the steps below:

# This script extracts key neurons for Q_{i,j}
bash get_neuron.sh 

# Calculates the proportion of specialized neurons for P_{i,j} using the specified threshold mode
python get_specialized_neuron.py --mode [MODE]

You can choose different threshold functions for neuron selection by specifying the --mode argument in get_specialized_neuron.py. The available modes and their possible values are:

  • layer-topk (default):

  • layer-topscore:

  • global_topk:

Citation

If you find this work helpful, please consider citing:

@misc{ying2025disentanglinglanguagecultureevaluating,
      title={Disentangling Language and Culture for Evaluating Multilingual Large Language Models}, 
      author={Jiahao Ying and Wei Tang and Yiran Zhao and Yixin Cao and Yu Rong and Wenxuan Zhang},
      year={2025},
      eprint={2505.24635},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.24635}, 
}

About

Repository for the paper "Disentangling Language Medium and Cultural Context for Evaluating Multilingual Large Language Models"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors