Hierarchical Language Model

PyTorch implementation for paper From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding, accepted by ACL 2023.

[Paper & Supplementary Material]

Abstract

Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. This fixed vocabulary limits the model’s robustness to spelling errors and its capacity to adapt to new domains. In this work, we introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach: one at the word level and another at the sequence level. Concretely, we design an intra-word module that uses a shallow Transformer architecture to learn word representations from their characters, and a deep inter-word Transformer module that contextualizes each word representation by attending to the entire word sequence. Our model thus directly operates on character sequences with explicit awareness of word boundaries, but without biased sub-word or word-level vocabulary. Experiments on various downstream tasks show that our method outperforms strong baselines. We also demonstrate that our hierarchical model is robust to textual corruption and domain shift.

Requirements

conda create -n hlm python=3.9 -y
conda activate hlm
pip install -r requirements.txt

Data Pre-Processing

python preprocess_data.py --csv_file ./data/sample_pretrain_data_raw.csv \
                          --out_file ./data/sample_pretrain_data_processed.csv

Pre-Training

torchrun --nproc_per_node 8 pretrain.py --config_json ./config/config_pretrain.json

Fine-Tuning

Benchmark on GLUE tasks:

sh scripts/run_glue.sh

Benchmark on CoNLL-2003 NER task:

sh scripts/run_ner.sh

Citation

@inproceedings{sun-etal-2023-characters,
    title = "From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding",
    author = "Sun, Li  and
      Luisier, Florian  and
      Batmanghelich, Kayhan  and
      Florencio, Dinei  and
      Zhang, Cha",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    doi = "10.18653/v1/2023.acl-long.200",
    pages = "3605--3620"
}

Reference

Shiba: https://github.com/octanove/shiba

DeBERTa: https://github.com/microsoft/DeBERTa

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
data		data
model		model
scripts		scripts
utils		utils
LICENSE		LICENSE
README.md		README.md
finetune_glue.py		finetune_glue.py
finetune_ner.py		finetune_ner.py
preprocess_data.py		preprocess_data.py
pretrain.py		pretrain.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hierarchical Language Model

[Paper & Supplementary Material]

Abstract

Requirements

Data Pre-Processing

Pre-Training

Fine-Tuning

Citation

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Language Model

[Paper & Supplementary Material]

Abstract

Requirements

Data Pre-Processing

Pre-Training

Fine-Tuning

Citation

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages