Xianjun Yang (@xianjun

Xianjun Yang

463 posts

Xianjun Yang

@xianjun_agi

MTS @reflection_ai Opinions are my own

linkedin.com/in/xianjun-yan…

Joined February 2020

Xianjun Yang
@xianjun_agi
Oct 22, 2025
I was laid off by Meta today. As a Research Scientist, my work was just cited by the legendary @johnschulman2 and Nicholas Carlini yesterday. I’m actively looking for new opportunities — please reach out if you have any openings!
Susan Zhang
@suchenzang
Oct 22, 2025
👀
1.8M
Xianjun Yang
@xianjun_agi
Oct 23, 2025
As a new grad and early-career researcher, I’m truly overwhelmed and grateful for the incredible support from the community. Within 24 hours, I’ve received hundreds of kind messages and job opportunities— a reminder of how warm and vibrant the AI community is. I’ll take time to
arxiv.org
Verifying Chain-of-Thought Reasoning via Its Computational Graph
Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We...
80K
Xianjun Yang
@xianjun_agi
Jan 7, 2025
✨My career update: excited to join @Meta as a research scientist to shape next generation of generative AI Safety @AIatMeta
21K
Xianjun Yang
@xianjun_agi
Feb 21, 2025
📢My New Paper: Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder TLDR: We proposed to use features from SAEs as a measure for data diversity&complexity and proved it's effectiveness on data selection for LLM tuning. arxiv.org/pdf/2502.14050
19K
Xianjun Yang
@xianjun_agi
Oct 6, 2023
New paper 🚨🚨🚨: arxiv.org/abs/2310.02949 Exciting (yet concerning) discoveries on new vulnerabilities of current LLMs: utilizing only 100 adversarial examples within 1 GPU hour can subvert safely aligned models to adapt to harmful tasks without sacrificing model helpfulness.
arxiv.org
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Warning: This paper contains examples of harmful language, and reader discretion is recommended. The increasing open release of powerful large language models (LLMs) has facilitated the...
26K
Xianjun Yang
@xianjun_agi
Oct 25, 2023
📢[New paper📚] Detection of LLMs-Generated Content We give the first comprehensive survey on the methods, datasets, attacks, challenges and outlooks about detection in the era of LLMs. w/@PanLiangming, @xuandongzhao, @WilliamWangNLP & etc. Paper link:
arxiv.org
A Survey on Detection of LLMs-Generated Content
The burgeoning capabilities of advanced large language models (LLMs) such as ChatGPT have led to an increase in synthetic content generation with implications across a variety of sectors,...
8.2K
Xianjun Yang
@xianjun_agi
Nov 17, 2025
It's worth noting that most AI detectors can only return a probability of whether the text is AI-generated. But a high probability alone can not serve as practical evidence. Luckily, our previous work published at ICLR 2024 can provide strong text-level EVIDENCE to support the
Graham Neubig
@gneubig
Nov 15, 2025
ICLR authors, want to check if your reviews are likely AI generated? ICLR reviewers, want to check if your paper is likely AI generated? Here are AI detection results for every ICLR paper and review from @pangram! It seems that ~21% of reviews may be AI?
20K
Xianjun Yang
@xianjun_agi
Mar 2, 2025
Real LLM test: It's tax season again! I had 3 employers last year due to internship, and relocated 4 times in 2 states. Only chatgpt4o gives me correct answers. Calude-3.7 and Grok-3 both failed at calculating STCG. #Claude37Sonnet #Grok3
7.1K
Xianjun Yang
@xianjun_agi
Feb 6, 2024
🚀 New Paper Alert! 🚀 Our latest paper introduces a novel weak-to-strong jailbreaking attack on Large Language Models Paper: arxiv.org/abs/2401.17256 Joint with @xuandongzhao @TianyuPang1 @duchao0726 @lileics @yuxiangw_cs @WilliamWangNLP
Xuandong Zhao
@xuandongzhao
Feb 6, 2024
🚀 New Research Alert! Our latest paper introduces a new method to test the robustness of LLMs against jailbreaking attacks. Discover the "Weak-to-Strong Jailbreaking on Large Language Models". Paper: arxiv.org/abs/2401.17256 Code: github.com/XuandongZhao/w…
3.4K
Xianjun Yang
@xianjun_agi
Jul 17, 2025
Replying to @WenhuChen
If I were a PhD student in China, I prefer Mexico. Traveling to a new country is great experience than staying at the same city lol
2.5K
Xianjun Yang
@xianjun_agi
Jul 10, 2024
📢New Paper📢 Happy to introduce our new work on whether multimodal LLMs achieved PhD-level intelligence across diverse scientific disciplines! It turns out that the most advanced MLLMs still lag behind a lot! #AI4Science
Zekun Li
@ZekunLi0323
Jul 10, 2024
🚨 Introducing “MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension” arxiv.org/abs/2407.04903 🧐Have current multimodal LLMs achieved PhD-level intelligence across diverse scientific disciplines? Are they ready to become AI scientific assistants?
8.1K
Xianjun Yang
@xianjun_agi
Oct 23, 2025
Replying to @cryptypo and @johnschulman2
DM opened
29K
Xianjun Yang
@xianjun_agi
Jun 8, 2024
Our previous work (Weak-to-Strong Jailbreaking: arxiv.org/abs/2401.17256) also found that "LLM safety alignment is only a few tokens deep." We focus on utilizing this vulnerability to attack a strong model. Happy to see another work shares the same finding but focuses on defense.
Xiangyu Qi
@xiangyuqi_pton
Jun 8, 2024
Our recent paper shows: 1. Crrent LLM safety alignment is only a few tokens deep. 2. Deepening the safety alignment can make it more robust against multiple jailbreak attacks. 3. Protecting initial token positions can make the alignment more robust against fine-tuning attacks.
arxiv.org
Weak-to-Strong Jailbreaking on Large Language Models
Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly....
2.6K
Xianjun Yang
@xianjun_agi
Apr 4, 2025
I spent one hour reading this. As an AI safety researcher, here is my comment: A good sci-fi but not serious scientific prediction.
Daniel Kokotajlo
@DKokotajlo
Apr 3, 2025
"How, exactly, could AI take over by 2027?" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex, @eli_lifland, and @thlarsen
2K