Awesome-Foundation-Model-Papers

A library of foundation models in computer vision, natural language processing and multi-modal learning. This repo mainly include pretraining methods, foundation models, fine-tuning methods and some projects etc.

Contributions are welcome!

本项目是一个视觉，语言和多模态基础模型的仓库。主要包括预训练方法，基础模型，微调方法和成熟的项目等。未来计划整理一些可以使用的开源模型和数据资源。

欢迎大家为项目贡献！

Awesome-Foundation-Model-Papers
Computer Vision
NLP Foundation Models
Multi-Modal Learning
Contributions
Citation

Computer Vision

Pretraining

MAE: Masked Autoencoders Are Scalable Vision Learners. [paper] [code](Masked Autoencoders Are Scalable Vision Learners)
EVA: Visual Representation Fantasies from BAAI. [01-paper] [02-paper] [code]
Scaling Vision Transformers. [paper] [code]
Scaling Vision Transformers to 22 Billion Parameters. [paper]
Segment Anything. [paper] [code] [project]
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer. [paper] [code]

Generation

Deep Floyd -IF [project]
Consistency Models. [paper] [code]
Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise. [paper] [code]
Edit Anything. [code]
GigaGAN: Scaling up GANs for Text-to-Image Synthesis. [paper]
Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. [paper] [project]

Unified Architecture for Vision

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
SegGPT: Segmenting Everything In Context. [paper] [code]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. [paper] [code]
SAM: Segment Everything Everywhere All at Once. [paper] [paper]
X-Decoder: Generalized Decoding for Pixel, Image, and Language. [paper] [code]
Unicorn 🦄 : Towards Grand Unification of Object Tracking. [paper] [code]
UniNeXt: Universal Instance Perception as Object Discovery and Retrieval. [paper] [code]
OneFormer: One Transformer to Rule Universal Image Segmentation. [paper] [code]
OpenSeeD: A Simple Framework for Open-Vocabulary Segmentation and Detection. [paper] [code]
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation. [paper] [code]
Pix2seq: A language modeling framework for object detection. [v1-paper] [v2-paper] [code]
TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding. [paper] [supplementary] [code]
Musketeer (All for One, and One for All): A Generalist Vision-Language Model with Task Explanation Prompts. [paper]
Fast Segment Anything. [paper] [code]

NLP Foundation Models

Pretraining

GPT: Improving language understanding by generative pre-training.
GPT-2: Language Models are Unsupervised Multitask Learners. [paper]
GPT-3: Language Models are Few-Shot Learners [paper]
GPT-4. [paper]
LLaMA: Open and Efficient Foundation Language Models. [paper] [code]
Pythia: Interpreting Autoregressive Transformers Across Time and Scale. [paper] [code]
PaLM： Scaling Language Modeling with Pathways. [paper]
RedPajama. [blog]
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instruction [paper] [code]
MPT. [blog] [code]
BiLLa: A Bilingual LLaMA with Enhanced Reasoning Ability. [paper]
OpenLLaMA: An Open Reproduction of LLaMA. [code]
InternLM. [code]

Instruction Tuning

InstructGPT: Training language models to follow instructions with human feedback. [paper] [blog]
Principle-Driven Self-Alignment of Language Modelsfrom Scratch with Minimal Human Supervision. [paper] [code]
Scaling instruction-finetuned language models. [paper]
Self-Instruct: Aligning Language Model with Self Generated Instructions. [paper] [code]
LIMA: Less Is More for Alignment. [paper]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4. [paper]
WizardLM: An Instruction-following LLM Using Evol-Instruct. [paper] [code]
QLoRA: Efficient Finetuning of Quantized LLMs. [paper] [code]
Instruction Tuning with GPT-4. [paper] [code]

RLHF

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [paper] [code]
RRHF: Rank Responses to Align Language Models with Human Feedback without tears. [paper] [code] [blog]
Beaver. [code]
MOSS-RLHF. [code]

Chat Models

Stanford Alpaca: An Instruction-following LLaMA Model. [code]
Alpaca LoRA. [code]
Vicuna. [code]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. [code] [paper] [v2-paper]
Stable Vicuna [project]
Koala: A Dialogue Model for Academic Research. [paper] [code]
Open-Assistant. [project]
GPT4ALL. [code] [demo]
ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human. [paper] [code]
CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society. [paper] [code]
MPTChat. [blog] [code]
ChatGLM2 [code]

Chinese Support

MOSS [code]
Luotuo [code]
Linly [code] [blog]
FastChat-T5. [code]
ChatGLM-6B. [code]
Chat-RWKV. [code]
baize. [paper] [code]

Multi-Modal Learning

Pretraining

CLIP: Learning Transferable Visual Models From Natural Language Supervision. [paper] [code]
ALBEF: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. [paper] [code]
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. [paper] [code]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [paper] [code] [dome] [blog]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [code]
Kosmos-1: Language Is Not All You Need: Aligning Perception with Language Models. [paper] [code]
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model [code]
LLaVA: Large Language and Vision Assistant. [paper] [project] [blog]
PaLM-E: An Embodied Multimodal Language Model. [paper] [code]
BEiT-3: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks. [paper]
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages. [paper]
IMAGEBIND: One Embedding Space To Bind Them All. [paper] [code]
PaLM 2. [paper]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. [paper]

Visual Chat Models

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. [paper] [code]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. [code] [paper] [v2-paper]
MMGPT：MultiModal-GPT: A Vision and Language Model for Dialogue with Humans. [paper] [code]
InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language [paper] [code]
VideoChat : Chat-Centric Video Understanding. [paper]
Otter: A Multi-Modal Model with In-Context Instruction Tuning. [paper] [code]
DetGPT: Detect What You Need via Reasoning. [paper]
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. [paper]
LLaVA: Large Language and Vision Assistant. [paper] [project] [blog]
VisualGLM. [code]
PandaGPT: One Model to Instruction-Follow Them All. [project]
ChatSpot. [demo]

Datasets

DataComp: In search of the next generation of multimodal datasets. [paper] [project]

Evaluation

MME. [paper]
Multimodal Chatbot Areana. [demo]

有一些更有影响力的仓库总结了大模型的相关工作：

[NLP] LLM
[MM] MLLM

Contributions

Contributions are welcome! Anyone interested in this program could send pull requests. I may list you as a contributor in this repo.

欢迎大家提交 pull request 来更新这个项目~我会将你列为项目的贡献者。

Citation

Please cite the repo if you find it useful.

@misc{chunjiang2023tobeawesome,
  author={Chunjiang Ge},
  title = {Awesome-Foundation-Model-Papers},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/John-Ge/awesome-foundation-models}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Foundation-Model-Papers

Computer Vision

Pretraining

Generation

Unified Architecture for Vision

NLP Foundation Models

Pretraining

Instruction Tuning

RLHF

Chat Models

Chinese Support

Multi-Modal Learning

Pretraining

Visual Chat Models

Datasets

Evaluation

Contributions

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome-Foundation-Model-Papers

Computer Vision

Pretraining

Generation

Unified Architecture for Vision

NLP Foundation Models

Pretraining

Instruction Tuning

RLHF

Chat Models

Chinese Support

Multi-Modal Learning

Pretraining

Visual Chat Models

Datasets

Evaluation

Contributions

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages