Jialong Tang, Pei Zhang, Baosong Yang, Rui Wang, Hai Hu
Shanghai Jiao Tong University & Tongyi Lab
If you have any questions, please contact Yikang Liu and Hai Hu.
Our work follows a long line of using minimal pairs to probe linguistic knowledge in language models.
- We introduce a new dataset, ZhoBLiMP, for Chinese miniml pairs.
- We train Zh-Pythia LM suite from scratch on Chinese data to investigate the learning of Chinese syntax.
- We propose a new linking function, SLLN-LP, to evaluate LMs for mitigation of the bias caused by pairs of unequal lengths.
- We build a GUI for semi-automatic minimal pair generation.
ZhoBLiMP is a dataset that can be used to probe Chinese linguistic knowledge in language models, especially syntax. It contains 35k minimal pairs that differ in a minimal way to demonstrate a single syntactic or semantic contrast. ZhoBLiMP has 118 paradigms in 15 high-level linguistic phenemena.
We include the following files in this repo:
data/human_validation_result.csv: the results of human validation for each paradigm generatedZhoBLiMP.tar.gz: the ZhoBLiMP datasetZhoBLiMP-excluded.tar.gz: the paradigms excluded for low human agreement
You can download the model checkpoints of Zh-Pythia used in our paper from the Hugging Face model hub.
huggingface-cli download SJTU-CL/Zh-Pythia-1.4B --local-dir path/to/save --local-dir-use-symlinks FalseAnd then you can evaluate the Zh-Pythia models or any other open-sourced ones on the ZhoBLiMP dataset by running the following command:
python src/run.py \
--data_dir data/ZhoBLiMP \
--model_name_or_path path/to/model \
--unigram_prob_file unigram/chinese-llama.json \
--batch_size 100 \
--max_length 64 \
--device cuda \
--output_dir resultsWe propose to use sublinear function to normalize length:
We release our codes of SLLN-LP and other linking functions in src/metrics. The function pow_norm_lp is the implementation of SLLN-LP.
For functions with frequency normalization, we release token unigram frequency in the training data of Zh-Pythia models unigram. Use the file according to the tokenizer of the target model, where you can find in our model card.
The ZhoBLiMP dataset is available in the file ZhoBLiMP.tar.gz. The dataset is generated by the following steps:
- We annotate lexicons with linguistic properties to make a vocabulary for generation (see
assets/vocab.tsv). - We craft grammar templates for each linguistic paradigm. Each paradigm is a json file (see
projects/ZhoBLiMP). - We generate minimal pairs by filling in the templates with the vocabulary through the module named
data_gen.
You can try the following command to generate the dataset:
python -m data_gen -I projects/ZhoBLiMP -O ZhoBLiMPYou can also launch the web interface to add or modify the templates, or start your own new project (please refer to the docs in frontend for the grammar of crafting templates).
cd frontend
python app.pyYou can check the following demo video for the web interface:
