Llama-2 just got released by @Meta AI and you can already use it in the @huggingface ecosystem.
How to fine-tune the model on your own data? We release a simple fine-tuning script for single & multi-gpu to get you ready in few lines of code
gist.github.com/younesbelkada/…
younes
1,613 posts
- You asked for it. You can now fine-tune a model that has been loaded in 8-bit. With 8-bit fine-tuning each 1B parameters only cost 1 GB of GPU RAM to fine-tune, making it easy to fine-tune any large models. huggingface.co/blog/peft Colab to fine-tune OPT-6.7B in Int8 below 🧵
- The first trillion parameter model on the Hub 🤯 Today we are proud to announce the release of the first Mixture of Experts (MoE) 🧙 models into @huggingface transformers! You can now easily, run, train and explore this fascinating architecture in the Hugging Face ecosystem! ⬇️
- A huge day for open source! 🔥 You can now load models from @huggingface in 4bit precision using load_in_4bit and bitsandbytes library, with no performance degradation. Announcement notes here: huggingface.co/blog/4bit-tran… Useful resources belowQLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark: Paper: arxiv.org/abs/2305.14314 Code+Demo: github.com/artidoro/qlora Samples: colab.research.google.com/drive/1kK6xasH… Colab: colab.research.google.com/drive/17XEqL1J…
- Fine tune a 20B Language Model with RLHF using a 24GB consumer GPU? 🤯 It is now possible using TRL + PEFT! Check out the blogpost that explains how we achieve this step by step! Blogpost: huggingface.co/blog/trl-peft
- New feature alert in the @huggingface ecosystem! Flash Attention 2 natively supported in huggingface transformers, supports training PEFT, and quantization (GPTQ, QLoRA, LLM.int8) First pip install flash attention and pass use_flash_attention_2=True when loading the model!
- Interested in applying RLHF (Reinforcement Learning with Human Feedback)? Try out trl! At @huggingface we now officially support RLHF training using PPO (Proximal Policy Optimization) Train your easily model in single, or multi-GPU setup. 🧵 github.com/lvwerra/trl
- MatCha and DePlot from @GoogleAI ! 🧠 A set of foundation models for plots and charts that can perform complex visual reasoning tasks such as plot summarisation/VQA. When combined with instruction-tuned LMs, you can create interesting demos, such as the one below ↓
- IPO algorithm, a new method from Google Deepmind: arxiv.org/abs/2310.12036 has been just added in Hugging Face TRL library ! Try it out now by installing TRL from source, simply pass `loss_type="ipo"` when initializing DPOTrainer: huggingface.co/docs/trl/main/…
- BLIP-2 8bit! 🧠 @salesforce has uploaded the first multi-modal chatbot on Hugging Face Hub! 🤯 BLIP2 has been released and open-sourced last week by @salesforce, run your model in 8-bit and start dialoguing with it with a few lines of code! huggingface.co/spaces/hysts/B…
- Mixtral on a free-tier Google Colab with AQLM-2bit quantization ! 🤯 Similarly as Quip#, Aqlm quantization method makes it possible to squueze down LLMs into impressive compression format, with a peak memory of ~13GB for mixtral ! notebook:
- You liked Flan-T5? 🍮 You'll like Flan-UL2 - now on Hugging Face - even more! Thanks @YiTayML @Google for making the weights of the Flan-UL2 model open-source! Repo: huggingface.co/google/flan-ul2 Spaces: huggingface.co/spaces/ybelkad… Inference endpoint: huggingface.co/inference-endp… 🧵
- Following up from the great work from community that enabled bitsandbytes 4-bit serialization, I pushed Mixtral-Instruct-bnb-4bit on @huggingface for anyone that wants to easily load the model
- Do you know that you can load @OpenAI's Whisper model in 8-bit using LLM.int8() from bitsandbytes & @TimDettmers ? How this quantization technique affects the performance of the model? @ArthurZucker ran some evaluation with 8-bit models and here are the results ⬇️













