A library of foundation models in computer vision, natural language processing and multi-modal learning. This repo mainly include pretraining methods, foundation models, fine-tuning methods and some projects etc.
Contributions are welcome!
本项目是一个视觉,语言和多模态基础模型的仓库。主要包括预训练方法,基础模型,微调方法和成熟的项目等。未来计划整理一些可以使用的开源模型和数据资源。
欢迎大家为项目贡献!
- Awesome-Foundation-Model-Papers
- Computer Vision
- NLP Foundation Models
- Multi-Modal Learning
- Contributions
- Citation
- MAE: Masked Autoencoders Are Scalable Vision Learners. [paper] [code](Masked Autoencoders Are Scalable Vision Learners)
- EVA: Visual Representation Fantasies from BAAI. [01-paper] [02-paper] [code]
- Scaling Vision Transformers. [paper] [code]
- Scaling Vision Transformers to 22 Billion Parameters. [paper]
- Segment Anything. [paper] [code] [project]
- UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer. [paper] [code]
- Deep Floyd -IF [project]
- Consistency Models. [paper] [code]
- Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise. [paper] [code]
- Edit Anything. [code]
- GigaGAN: Scaling up GANs for Text-to-Image Synthesis. [paper]
- Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. [paper] [project]
- Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
- Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
- SegGPT: Segmenting Everything In Context. [paper] [code]
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. [paper] [code]
- SAM: Segment Everything Everywhere All at Once. [paper] [paper]
- X-Decoder: Generalized Decoding for Pixel, Image, and Language. [paper] [code]
- Unicorn 🦄 : Towards Grand Unification of Object Tracking. [paper] [code]
- UniNeXt: Universal Instance Perception as Object Discovery and Retrieval. [paper] [code]
- OneFormer: One Transformer to Rule Universal Image Segmentation. [paper] [code]
- OpenSeeD: A Simple Framework for Open-Vocabulary Segmentation and Detection. [paper] [code]
- FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation. [paper] [code]
- Pix2seq: A language modeling framework for object detection. [v1-paper] [v2-paper] [code]
- TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding. [paper] [supplementary] [code]
- Musketeer (All for One, and One for All): A Generalist Vision-Language Model with Task Explanation Prompts. [paper]
- Fast Segment Anything. [paper] [code]
- GPT: Improving language understanding by generative pre-training.
- GPT-2: Language Models are Unsupervised Multitask Learners. [paper]
- GPT-3: Language Models are Few-Shot Learners [paper]
- GPT-4. [paper]
- LLaMA: Open and Efficient Foundation Language Models. [paper] [code]
- Pythia: Interpreting Autoregressive Transformers Across Time and Scale. [paper] [code]
- PaLM: Scaling Language Modeling with Pathways. [paper]
- RedPajama. [blog]
- LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instruction [paper] [code]
- MPT. [blog] [code]
- BiLLa: A Bilingual LLaMA with Enhanced Reasoning Ability. [paper]
- OpenLLaMA: An Open Reproduction of LLaMA. [code]
- InternLM. [code]
- InstructGPT: Training language models to follow instructions with human feedback. [paper] [blog]
- Principle-Driven Self-Alignment of Language Modelsfrom Scratch with Minimal Human Supervision. [paper] [code]
- Scaling instruction-finetuned language models. [paper]
- Self-Instruct: Aligning Language Model with Self Generated Instructions. [paper] [code]
- LIMA: Less Is More for Alignment. [paper]
- Orca: Progressive Learning from Complex Explanation Traces of GPT-4. [paper]
- WizardLM: An Instruction-following LLM Using Evol-Instruct. [paper] [code]
- QLoRA: Efficient Finetuning of Quantized LLMs. [paper] [code]
- Instruction Tuning with GPT-4. [paper] [code]
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [paper] [code]
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears. [paper] [code] [blog]
- Beaver. [code]
- MOSS-RLHF. [code]
- Stanford Alpaca: An Instruction-following LLaMA Model. [code]
- Alpaca LoRA. [code]
- Vicuna. [code]
- LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. [code] [paper] [v2-paper]
- Stable Vicuna [project]
- Koala: A Dialogue Model for Academic Research. [paper] [code]
- Open-Assistant. [project]
- GPT4ALL. [code] [demo]
- ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human. [paper] [code]
- CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society. [paper] [code]
- MPTChat. [blog] [code]
- ChatGLM2 [code]
- MOSS [code]
- Luotuo [code]
- Linly [code] [blog]
- FastChat-T5. [code]
- ChatGLM-6B. [code]
- Chat-RWKV. [code]
- baize. [paper] [code]
- CLIP: Learning Transferable Visual Models From Natural Language Supervision. [paper] [code]
- ALBEF: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. [paper] [code]
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. [paper] [code]
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [paper] [code] [dome] [blog]
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [code]
- Kosmos-1: Language Is Not All You Need: Aligning Perception with Language Models. [paper] [code]
- Versatile Diffusion: Text, Images and Variations All in One Diffusion Model [code]
- LLaVA: Large Language and Vision Assistant. [paper] [project] [blog]
- PaLM-E: An Embodied Multimodal Language Model. [paper] [code]
- BEiT-3: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks. [paper]
- X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages. [paper]
- IMAGEBIND: One Embedding Space To Bind Them All. [paper] [code]
- PaLM 2. [paper]
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. [paper]
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. [paper] [code]
- LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. [code] [paper] [v2-paper]
- MMGPT:MultiModal-GPT: A Vision and Language Model for Dialogue with Humans. [paper] [code]
- InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language [paper] [code]
- VideoChat : Chat-Centric Video Understanding. [paper]
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. [paper] [code]
- DetGPT: Detect What You Need via Reasoning. [paper]
- VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. [paper]
- LLaVA: Large Language and Vision Assistant. [paper] [project] [blog]
- VisualGLM. [code]
- PandaGPT: One Model to Instruction-Follow Them All. [project]
- ChatSpot. [demo]
有一些更有影响力的仓库总结了大模型的相关工作:
Contributions are welcome! Anyone interested in this program could send pull requests. I may list you as a contributor in this repo.
欢迎大家提交 pull request 来更新这个项目~我会将你列为项目的贡献者。
Please cite the repo if you find it useful.
@misc{chunjiang2023tobeawesome,
author={Chunjiang Ge},
title = {Awesome-Foundation-Model-Papers},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/John-Ge/awesome-foundation-models}},
}