alphaXiv

Explore

Sign In

Blog

Feedback

Browser Extension

Upgrade to Pro

Dark mode

We're hiring

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt+↵ To search

Events

Watch Recordings
Bring Research to Life: alphaXiv x marimo05/01 · alphaXiv x marimo
Why LLMs Aren’t Scientists Yet: Lessons from Four Autonomous Research Attempts05/08 · Prof. Dhruv Kumar and Dhruv Trehan · alphaXiv
HotLikes
Sign in
HotLikes
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
27 Apr 2026
Zhiheng Liu
Weiming Ren
Xiaoke Huang

Meta AI researchers introduced Tuna-2, a unified multimodal model that performs visual understanding and generation directly from pixel embeddings, eliminating the need for pretrained vision encoders. This encoder-free architecture achieved competitive performance across nine VQA benchmarks and state-of-the-art results among native UMMs for image generation and editing tasks.

View blog
#computer-science#computer-vision-and-pattern-recognition#image-generation
Resources
Paper thumbnail
552
There Will Be a Scientific Theory of Deep Learning
23 Apr 2026
Jamie Simon
Daniel Kunin
Alexander Atanasov

A collaborative group of researchers presents a synthetic argument for the emergence of 'learning mechanics,' a scientific theory of deep learning that aims to explain and predict neural network behavior through mathematical, first-principles calculations. It consolidates diverse theoretical and empirical evidence, suggesting a path toward a unified understanding of phenomena like scaling laws, training dynamics, and universal representations.

View blog
#computer-science#machine-learning#machine-psychology
Resources61
Paper thumbnail
3,182
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
27 Apr 2026
Weijie Wang
Xiaoxuan He
Youping Gu

Researchers from Zhejiang University and Microsoft Research developed World-R1, a reinforcement learning framework that imbues existing text-to-video foundation models with robust 3D geometric consistency without architectural modifications. This approach led to a PSNR improvement of up to 10.23dB over baseline models and a 92% user preference for geometric consistency in generated videos.

View blog
#computer-science#computer-vision-and-pattern-recognition#deep-reinforcement-learning
Resources21
Paper thumbnail
358
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
24 Apr 2026
DeepSeek-AI

DeepSeek-V4 introduces models that efficiently process contexts up to one million tokens through a hybrid attention architecture and optimized infrastructure, reducing single-token inference FLOPs by up to 73% and KV cache usage by up to 90% compared to its predecessor. The models achieve competitive performance across reasoning, coding, and long-context tasks, establishing new open-source benchmarks.

View blog
#computer-science#artificial-intelligence#computation-and-language
ResourcesTwitter (X) logo268
Paper thumbnail
3,212
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
27 Apr 2026
Keshav Ramji
Tahira Naseem
Ramón Fernandez Astudillo

Researchers at IBM Research AI developed Abstract Chain-of-Thought (Abstract-CoT), a post-training framework that replaces lengthy verbalized rationales with a short sequence of discrete, abstract tokens. This method achieves comparable or improved performance over traditional Chain-of-Thought while reducing reasoning token usage by up to 12 times across various benchmarks and language models.

View blog
#chain-of-thought#computer-science#computation-and-language
Resources
Paper thumbnail
906
Kwai Summary Attention Technical Report
27 Apr 2026
Chenglong Chu
Guorui Zhou
Guowang Zhang

Kuaishou's OneRec Team developed Kwai Summary Attention (KSA), a hybrid attention mechanism that compresses historical context into learnable summary tokens to enable efficient processing of long input sequences in Large Language Models. This approach reduces KV cache cost to O(N/k) and enhances long-context retrieval, achieving a 5.81-point gain over full attention on RULER-128K and decreasing KV cache memory by 2.5 times at 128K context length.

View blog
#attention-mechanisms#computer-science#artificial-intelligence
Resources
Paper thumbnail
219
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
24 Apr 2026
Meng Chu
Xuan Billy Zhang
Kevin Qinghong Lin

This research introduces a "levels × laws" taxonomy for agentic world modeling, categorizing capabilities into Predictor, Simulator, and Evolver levels and world types into physical, digital, social, and scientific regimes. The framework unifies disparate research efforts, providing criteria for decision-usable simulation and autonomous model revision, and highlights the shift from passive prediction to active, adaptable environmental understanding for AI agents.

View blog
#agentic-frameworks#agents#computer-science
Resources15
Paper thumbnail
604
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
26 Apr 2026
Alexis Limozin
Eduard Durech
Torsten Hoefler

This research re-evaluates mixed-policy optimization methods for large language model (LLM) reasoning by identifying and correcting bugs in widely used supervised fine-tuning (SFT) training frameworks. It demonstrates that a correctly implemented SFT-then-reinforcement learning (RL) pipeline consistently outperforms current mixed-policy approaches on mathematical reasoning benchmarks, often with greater computational efficiency.

View blog
#computer-science#artificial-intelligence#computation-and-language
Resources
Paper thumbnail
111
EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks
26 Apr 2026
Yihang Li
Xuelong Wei
Jingzhou Luo

Joy Future Academy introduced EgoLive, an extensive open-source egocentric dataset featuring 1,680 hours of high-resolution stereo video and multi-modal annotations from 65,866 real-world human task episodes. This dataset provides 6-DoF motion tracking, semantic segmentation, 3D scene reconstruction, and hierarchical language descriptions to foster generalizable robot manipulation models, demonstrating broader semantic coverage and more accurate depth estimation compared to prior work.

View blog
#computer-science#robotics
Resources
Paper thumbnail
169
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
28 Apr 2026
Jiaqi Wang
Wenhao Zhang
Weijie Shi

Researchers from Alibaba Group and The Chinese University of Hong Kong introduced Temporal Curriculum On-Policy Distillation (TCOD), a framework designed to stabilize on-policy distillation for multi-turn autonomous agents by controlling trajectory depth. TCOD improved success rates by up to 15.71 points over vanilla OPD on ALFWorld and reduced total training time by nearly 32%.

View blog
#agents#computer-science#artificial-intelligence
Resources
Paper thumbnail
105
GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning
28 Apr 2026
Yufei Jia
Heng Zhang
Ziheng Zhang

GS-Playground is a high-throughput simulator combining a parallel physics engine with a memory-efficient 3D Gaussian Splatting renderer to create scalable photorealistic environments for vision-informed robot learning. It achieved 10,000 FPS rendering at 640x480 resolution and enabled successful zero-shot sim-to-real transfer for various tasks, including manipulation with a 90% success rate.

View blog
#computer-science#robotics
Resources
Paper thumbnail
92
Hyperloop Transformers
25 Apr 2026
Abbas Zeitoun
Lucas Torroba-Hennigen
Yoon Kim

The Massachusetts Institute of Technology developed Hyperloop Transformers, an architecture integrating looped Transformers with strategic hyper-connections to achieve parameter efficiency. This approach yielded approximately 50% fewer parameters than depth-matched baselines while achieving lower perplexity, improved downstream task performance, and strong robustness to INT4 quantization.

View blog
#computer-science#computation-and-language#machine-learning
ResourcesTwitter (X) logo1
Paper thumbnail
1,365
Image Generators are Generalist Vision Learners
22 Apr 2026
Valentin Gabeur
Shangbang Long
Songyou Peng

Google's Vision Banana model, created by instruction-tuning a pretrained image generator, demonstrates that generative models can achieve state-of-the-art performance in both visual understanding and generation. It surpasses existing specialized models on tasks like semantic segmentation and metric depth estimation while maintaining high-quality image generation capabilities.

View blog
#computer-science#artificial-intelligence#computer-vision-and-pattern-recognition
Resources
Paper thumbnail
4,257
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
26 Apr 2026
Fanqing Meng
Lingxiao Du
Zijian Wu

ClawMark is presented as a benchmark designed to evaluate language model agents functioning as persistent coworkers across multi-day, multi-turn workflows in dynamic environments requiring raw multimodal evidence. Evaluation of frontier models on ClawMark revealed that while partial progress is achievable, strict task success rates remain low, with models struggling particularly with detecting unannounced environmental changes and committing backend writebacks.

View blog
#agentic-frameworks#agents#computer-science
Resources78
Paper thumbnail
147
ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
27 Apr 2026
Yiming Zhang
Jiacheng Chen
Jiaqi Tan

Researchers at Simon Fraser University and collaborators developed ReVSI, a meticulously rebuilt benchmark for evaluating Visual Spatial Intelligence in Vision-Language Models (VLMs), addressing critical validity flaws in previous evaluation datasets. ReVSI revealed that proprietary VLMs generally outperform open-source models on numerical tasks, while specialized fine-tuned 3D models often exhibit reduced or even negative performance gains, frequently hallucinating due to noisy annotations and data biases in prior benchmarks.

View blog
#computer-science#computer-vision-and-pattern-recognition#data-curation
Resources22
Paper thumbnail
60
JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
28 Apr 2026
Xinjie Chen
Biao Fu
Jing Wu
Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.
View blog
#agents#computer-science#artificial-intelligence
Resources
Paper thumbnail
56
Recursive Multi-Agent Systems
28 Apr 2026
Xiyuan Yang
Jiaru Zou
Rui Pan

RecursiveMAS introduces a framework that integrates recursive computation into multi-agent systems, enabling agents to refine collaborative reasoning through iterative latent-space interactions rather than explicit text. This approach leads to average accuracy improvements of up to 20.2% and inference speedups of up to 2.4x compared to text-based recursive multi-agent systems, significantly reducing token usage.

View blog
#agentic-frameworks#agents#chain-of-thought
Resources5,027
Paper thumbnail
54
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
25 Apr 2026
Haohang Huang
Xuan Lu
Mingyi Su

This research introduces MMEB-V3, a comprehensive benchmark spanning text, image, video, and audio modalities, and OmniSET, a diagnostic framework designed to evaluate omni-modality embedding models' ability to interpret and enforce explicit modality constraints. Experiments demonstrate that current models frequently fail to reliably retrieve content in the instructed target modality, often exhibiting strong modality biases and insufficient, misaligned instruction-induced embedding shifts.

View blog
#computer-science#information-retrieval
Resources637
Paper thumbnail
93
The Last Human-Written Paper: Agent-Native Research Artifacts
27 Apr 2026
Jiachen Liu
Jiaxin Pei
Jintao Huang

A protocol called Agent-Native Research Artifact (ARA) is introduced, reframing research contributions as machine-executable knowledge packages. This framework quantifiably improves knowledge extraction by 21.3%, boosts reproduction success by 7.0%, and accelerates early-phase research extensions for AI agents.

View blog
#agentic-frameworks#agents#computer-science
Resources
Paper thumbnail
56
Skill Retrieval Augmentation for Agentic AI
27 Apr 2026
Weihang Su
Jianming Long
Qingyao Ai

A new paradigm, Skill Retrieval Augmentation (SRA), is proposed to address the scalability of external skill utilization in agentic Large Language Models. Researchers at Tsinghua University developed SRA-Bench, a benchmark featuring 26,262 skills, and demonstrated that while external skills enhance agent performance, current LLMs face significant challenges in effectively incorporating and applying these skills under noisy conditions, lacking both relevance and need-awareness.

View blog
#agentic-frameworks#agents#computer-science
Resources1
Paper thumbnail
67
There are no more papers matching your filters at the moment.