Han Guo (@HanGuo97) / X

Han Guo

3,447 posts

Han Guo

@HanGuo97

PhD Student @MIT_CSAIL | Past: @togethercompute @LTIatCMU @MITIBMLab @UNCNLP, @SFResearch, @BaiduResearch | Machine Learning, NLP.

Joined August 2016

Han Guo
@HanGuo97
Jun 6, 2025
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
264K
Han Guo
@HanGuo97
Nov 21, 2023
Introducing LQ-LoRA Decomposing pretrained matrices into (fixed) quantized + (trainable) low-rank components enables more aggressive quantization. We can quantize LLaMA-2 70B to 2.5 bits with minimal degradation in instruction-tuning performance. arxiv.org/abs/2311.12023 🧵1/n
114K
Han Guo
@HanGuo97
Jul 21, 2024
Introducing FLUTE, a CUDA kernel for non-uniformly quantized (via a lookup table) LLM Inference. It accelerates QLoRA's NormalFloat (NF) out of the box and more. As an application, we extended NF4 and are releasing quantized models for LLaMA-3 (8B/70B) and Gemma-2 (9B/27B).
55K
Han Guo
@HanGuo97
Jul 22, 2025
Since our initial arXiv post, several concurrent papers have introduced new architectures with log-linear properties in various forms. Two personal favorites of mine (among others) are: - Transformer-PSM by @MorrisYau et al., and - Radial Attention by Xingyang and @lmxyy1999 et
Han Guo
@HanGuo97
Jun 6, 2025
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
21K
Han Guo
@HanGuo97
Dec 11, 2022
While I'm not at #EMNLP2022, we have two works on the intersection of RL + NLP. RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning (arxiv.org/abs/2205.12548) Efficient (Soft) Q-Learning for Text Generation with Limited Good Data (arxiv.org/abs/2106.07704)
Han Guo
@HanGuo97
Jan 2, 2021
Glad to share our latest work "FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging"! Joint work with @nazneenrajani @peterbhase @mohitban47 @caimingxiong (@uncnlp @sfresearch). Paper: arxiv.org/abs/2012.15781 Code: github.com/salesforce/fas… 1/5
Han Guo
@HanGuo97
Oct 14, 2022
Super excited to be among this cohort of amazing people! A huge thanks to @ericxing, @yoonrkim, @ZhitingHu, @mohitban47, and everyone who provided mentorship and advice!!
Microsoft Research
@MSFTResearch
Oct 14, 2022
At Microsoft Research, we aim to empower the next generation of computing related research talent. Today, we're thrilled to announce and congratulate this year's Microsoft Research PhD Fellowship recipients from around the world. Meet the 2022 recipients: aka.ms/phdfellowship
Han Guo
@HanGuo97
Apr 16, 2020
Excited to share that I'll be joining @LTIatCMU as a PhD student this fall after three wonderful undergraduate years at @UNCNLP! Huge thanks to everyone who gave me mentorship and help along the way, especially my advisor Mohit @mohitban47 and collaborator Ram @ramakanth1729! 😀
Han Guo
@HanGuo97
Jun 1, 2024
I've had some chances recently to share what we've been working on. In doing so, I made a few basic background slides that explain `torch.matmul` from GPU/CUDA's point of view, why LLM decoding is memory bound, and how weight-only quantization could speed up decoding. Slides 👇
22K
Han Guo
@HanGuo97
Jan 18, 2024
Happy to share that LQ-LoRA will appear at #ICLR2024. TLDR: using matrix decomposition to enable more aggressive quantization before LoRA fine-tuning. - Paper (updated): arxiv.org/abs/2311.12023. - Code (with more artifacts uploaded such as models): github.com/HanGuo97/lq-lo….
Han Guo
@HanGuo97
Nov 21, 2023
Introducing LQ-LoRA Decomposing pretrained matrices into (fixed) quantized + (trainable) low-rank components enables more aggressive quantization. We can quantize LLaMA-2 70B to 2.5 bits with minimal degradation in instruction-tuning performance. arxiv.org/abs/2311.12023 🧵1/n
arxiv.org
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for...
We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision...
25K
Han Guo
@HanGuo97
Jun 17, 2021
Excited to share our latest work with Bowen Tan @waterluffy Eric Xing @ZhitingHu! Tldr, a new NLG formulation from soft Q-learning perspective, with app. such as learning from noisy data, text attacks, prompt generation. Paper arxiv.org/abs/2106.07704 Code github.com/HanGuo97/soft-…
Han Guo
@HanGuo97
Apr 29, 2023
Unfortunately, I won't be at #ICLR2023, but please check out our recent works on Machine Learning + Systems! 1. Federated Learning as Variational Inference iclr.cc/virtual/2023/p… 2. MPCFormer: Fast, Performant, and Private Transformer inference with MPC iclr.cc/virtual/2023/p…
17K
Han Guo
@HanGuo97
Jun 6, 2025
Replying to @HanGuo97
There has been much recent work on efficient alternatives with sub-quadratic compute and sub-linear memory, including linear attention, state-space models, and long convolution models. Despite their differences, many of these approaches can be captured by the following equation:
11K
Han Guo
@HanGuo97
Sep 14, 2021
Happy to share that our FastIF paper's been accepted at #EMNLP2021! Thanks to wonderful coauthors @nazneenrajani @peterbhase @mohitban47 @CaimingXiong @uncnlp @SFResearch @LTIatCMU Updated paper/code (w. more exps on ANLI/WILDS): arxiv.org/abs/2012.15781 github.com/salesforce/fas…