Skip to content

sjtu-compling/ZhoBLiMP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese

Yikang Liu, Yeting Shen, Hongao Zhu, Lilong Xu, Zhiheng Qian, Siyuan Song, Kejia Zhang
Jialong Tang, Pei Zhang, Baosong Yang, Rui Wang, Hai Hu

Shanghai Jiao Tong University & Tongyi Lab

If you have any questions, please contact Yikang Liu and Hai Hu.

📝 Paper 🤗 Models

Overview

Our work follows a long line of using minimal pairs to probe linguistic knowledge in language models.

  • We introduce a new dataset, ZhoBLiMP, for Chinese miniml pairs.
  • We train Zh-Pythia LM suite from scratch on Chinese data to investigate the learning of Chinese syntax.
  • We propose a new linking function, SLLN-LP, to evaluate LMs for mitigation of the bias caused by pairs of unequal lengths.
  • We build a GUI for semi-automatic minimal pair generation.

ZhoBLiMP dataset

ZhoBLiMP is a dataset that can be used to probe Chinese linguistic knowledge in language models, especially syntax. It contains 35k minimal pairs that differ in a minimal way to demonstrate a single syntactic or semantic contrast. ZhoBLiMP has 118 paradigms in 15 high-level linguistic phenemena.

We include the following files in this repo:

  • data/human_validation_result.csv: the results of human validation for each paradigm generated
  • ZhoBLiMP.tar.gz: the ZhoBLiMP dataset
  • ZhoBLiMP-excluded.tar.gz: the paradigms excluded for low human agreement

Zh-Pythia LM suite

You can download the model checkpoints of Zh-Pythia used in our paper from the Hugging Face model hub.

huggingface-cli download SJTU-CL/Zh-Pythia-1.4B --local-dir path/to/save --local-dir-use-symlinks False

And then you can evaluate the Zh-Pythia models or any other open-sourced ones on the ZhoBLiMP dataset by running the following command:

python src/run.py \
    --data_dir data/ZhoBLiMP \
    --model_name_or_path path/to/model \ 
    --unigram_prob_file unigram/chinese-llama.json \
    --batch_size 100 \
    --max_length 64 \
    --device cuda \
    --output_dir results

SLLN-LP linking function

We propose to use sublinear function to normalize length:

$\text{SLLN-LP}(x)=\frac{\log P(x)}{|x|^\alpha}, \alpha \in \left(0, 1\right).$

We release our codes of SLLN-LP and other linking functions in src/metrics. The function pow_norm_lp is the implementation of SLLN-LP.

For functions with frequency normalization, we release token unigram frequency in the training data of Zh-Pythia models unigram. Use the file according to the tokenizer of the target model, where you can find in our model card.

Data generation interface

The ZhoBLiMP dataset is available in the file ZhoBLiMP.tar.gz. The dataset is generated by the following steps:

  • We annotate lexicons with linguistic properties to make a vocabulary for generation (see assets/vocab.tsv).
  • We craft grammar templates for each linguistic paradigm. Each paradigm is a json file (see projects/ZhoBLiMP).
  • We generate minimal pairs by filling in the templates with the vocabulary through the module named data_gen.

You can try the following command to generate the dataset:

python -m data_gen -I projects/ZhoBLiMP -O ZhoBLiMP

You can also launch the web interface to add or modify the templates, or start your own new project (please refer to the docs in frontend for the grammar of crafting templates).

cd frontend
python app.py

You can check the following demo video for the web interface:

data-gen-demo.mp4

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors