ResearchGym: Evaluating Language Model Agents on Real-World AI Research Paper • 2602.15112 • Published 2 days ago • 13
BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents Paper • 2602.12876 • Published 5 days ago • 6
REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents Paper • 2602.14234 • Published 3 days ago • 17
Multimodal Fact-Level Attribution for Verifiable Reasoning Paper • 2602.11509 • Published 7 days ago • 4
P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling Paper • 2602.12116 • Published 6 days ago • 4
Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision Paper • 2602.12164 • Published 6 days ago • 4
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments Paper • 2602.11964 • Published 6 days ago • 12
LawThinker: A Deep Research Legal Agent in Dynamic Environments Paper • 2602.12056 • Published 6 days ago • 33
The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies Paper • 2602.09877 • Published 8 days ago • 187
On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs Paper • 2602.12506 • Published 6 days ago • 3
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents Paper • 2602.12984 • Published 5 days ago • 5
What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis Paper • 2602.12395 • Published 6 days ago • 14
Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL Paper • 2602.03773 • Published 15 days ago • 9
Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards Paper • 2602.10231 • Published 8 days ago • 12
InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery Paper • 2602.08990 • Published 9 days ago • 69