In this directory, we provide the torch scripts for the experiments in QUIK. We mainly focus on the language generation task. We provide the scripts for OPT, LLaMA-2, and Falcon models.
torch: tested on v2.0.0transformers: tested on v4.31.0datasets: tested on v1.17.0
Currently, we have the implementation of the following models:
- OPT in
opt.py - LLaMA-2 in
llama.py - Falcon in
falcon.py
You can simply run the above files to reproduce the results in the paper. The main arguments are:
--model: the model name (or path to the weights)--dataset: the calibration dataset for GPTQ quantization--fp_features: the number of outliers (should be saved inact_scalesdirectory)--fp_relative: Whether we want to use more outliers (proportional to the number of input features) in thedown_projorfc2in LLaMA-2 and Falcon models.--int8_down_proj: Whether we want to use int8 quantization fordown_projin LLaMA-2 models.--int8_fc2: Whether we want to use int8 quantization forfc2in Falcon models.--hf_token: HuggingFace token for accessing to the LLaMA-2 and Falcon models.--a_bits: the number of bits for activation quantization--w_bits: the number of bits for weight quantization--w_clip: Whether we want to clip the weightssparsity: Sparse weights using SparseGPT (for Falcon models)prunen: N for N:M using SparseGPT (for Falcon models)prunem: M for N:M using SparseGPT (for Falcon models)
For example, to run LLaMA2-70B model with 256 outliers, you can run the following command:
python llama.py --model meta-llama/Llama-2-70b-hf --fp_features 256 --fp_relative --a_bits 4 --w_bits 4 --w_clip --hf_token <your_hf_token>