A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese

Yikang Liu, Yeting Shen, Hongao Zhu, Lilong Xu, Zhiheng Qian, Siyuan Song, Kejia Zhang
Jialong Tang, Pei Zhang, Baosong Yang, Rui Wang, Hai Hu

Shanghai Jiao Tong University & Tongyi Lab

^{If you have any questions, please contact Yikang Liu and Hai Hu.}

📝 Paper 🤗 Models

Overview

Our work follows a long line of using minimal pairs to probe linguistic knowledge in language models.

We introduce a new dataset, ZhoBLiMP, for Chinese miniml pairs.
We train Zh-Pythia LM suite from scratch on Chinese data to investigate the learning of Chinese syntax.
We propose a new linking function, SLLN-LP, to evaluate LMs for mitigation of the bias caused by pairs of unequal lengths.
We build a GUI for semi-automatic minimal pair generation.

ZhoBLiMP dataset

ZhoBLiMP is a dataset that can be used to probe Chinese linguistic knowledge in language models, especially syntax. It contains 35k minimal pairs that differ in a minimal way to demonstrate a single syntactic or semantic contrast. ZhoBLiMP has 118 paradigms in 15 high-level linguistic phenemena.

We include the following files in this repo:

data/human_validation_result.csv: the results of human validation for each paradigm generated
ZhoBLiMP.tar.gz: the ZhoBLiMP dataset
ZhoBLiMP-excluded.tar.gz: the paradigms excluded for low human agreement

Zh-Pythia LM suite

You can download the model checkpoints of Zh-Pythia used in our paper from the Hugging Face model hub.

huggingface-cli download SJTU-CL/Zh-Pythia-1.4B --local-dir path/to/save --local-dir-use-symlinks False

And then you can evaluate the Zh-Pythia models or any other open-sourced ones on the ZhoBLiMP dataset by running the following command:

python src/run.py \
    --data_dir data/ZhoBLiMP \
    --model_name_or_path path/to/model \ 
    --unigram_prob_file unigram/chinese-llama.json \
    --batch_size 100 \
    --max_length 64 \
    --device cuda \
    --output_dir results

SLLN-LP linking function

We propose to use sublinear function to normalize length:

$\text{SLLN-LP}(x)=\frac{\log P(x)}{|x|^\alpha}, \alpha \in \left(0, 1\right).$

We release our codes of SLLN-LP and other linking functions in src/metrics. The function pow_norm_lp is the implementation of SLLN-LP.

For functions with frequency normalization, we release token unigram frequency in the training data of Zh-Pythia models unigram. Use the file according to the tokenizer of the target model, where you can find in our model card.

Data generation interface

The ZhoBLiMP dataset is available in the file ZhoBLiMP.tar.gz. The dataset is generated by the following steps:

We annotate lexicons with linguistic properties to make a vocabulary for generation (see assets/vocab.tsv).
We craft grammar templates for each linguistic paradigm. Each paradigm is a json file (see projects/ZhoBLiMP).
We generate minimal pairs by filling in the templates with the vocabulary through the module named data_gen.

You can try the following command to generate the dataset:

python -m data_gen -I projects/ZhoBLiMP -O ZhoBLiMP

You can also launch the web interface to add or modify the templates, or start your own new project (please refer to the docs in frontend for the grammar of crafting templates).

cd frontend
python app.py

You can check the following demo video for the web interface:

data-gen-demo.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese

Overview

ZhoBLiMP dataset

Zh-Pythia LM suite

SLLN-LP linking function

Data generation interface

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
data		data
data_gen		data_gen
frontend		frontend
img		img
projects/ZhoBLiMP		projects/ZhoBLiMP
src		src
unigram		unigram
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese

Overview

ZhoBLiMP dataset

Zh-Pythia LM suite

SLLN-LP linking function

Data generation interface

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages