Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts Paper • 2606.05922 • Published 17 days ago • 67
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales Paper • 2606.12736 • Published 12 days ago • 5
LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling Paper • 2606.18023 • Published 6 days ago • 197
Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance Paper • 2606.19195 • Published 5 days ago • 111
WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces Paper • 2606.09426 • Published 14 days ago • 102
Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding? Paper • 2606.08063 • Published 16 days ago • 79
From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning Paper • 2606.07190 • Published 17 days ago • 35
Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs Paper • 2605.30611 • Published 25 days ago • 246