22 84 12

Xirui Li PRO

AIcell

https://xirui-li.github.io/

AI & ML interests

Multi-Modality

Recent Activity

upvoted a paper 1 day ago

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

authored a paper 2 days ago

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

updated a collection 2 days ago

ClawEnvKit

View all activity

Organizations

upvoted a paper 1 day ago

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Paper • 2604.18240 • Published 4 days ago • 14

upvoted 2 papers 2 days ago

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Paper • 2604.18543 • Published 4 days ago • 26

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Paper • 2604.18292 • Published 4 days ago • 77

upvoted a paper 6 days ago

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

Paper • 2604.14268 • Published 9 days ago • 110

upvoted a paper 9 days ago

CocoaBench: Evaluating Unified Digital Agents in the Wild

Paper • 2604.11201 • Published 11 days ago • 35

upvoted a paper 10 days ago

FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Paper • 2604.07413 • Published 16 days ago • 94

upvoted 2 papers 11 days ago

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Paper • 2604.08523 • Published 15 days ago • 259

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Paper • 2604.06628 • Published 16 days ago • 319

upvoted a paper 12 days ago

Imagination Helps Visual Reasoning, But Not Yet in Latent Space

Paper • 2602.22766 • Published Feb 26 • 44

upvoted a paper 14 days ago

RAGEN-2: Reasoning Collapse in Agentic RL

Paper • 2604.06268 • Published 17 days ago • 65

upvoted 6 papers 15 days ago

FileGram: Grounding Agent Personalization in File-System Behavioral Traces

Paper • 2604.04901 • Published 18 days ago • 40

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Paper • 2604.05015 • Published 18 days ago • 233

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Paper • 2604.05172 • Published 18 days ago • 24

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Paper • 2604.04323 • Published 18 days ago • 41

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Paper • 2604.03922 • Published 19 days ago • 53

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Paper • 2604.06132 • Published 17 days ago • 117

upvoted a paper 16 days ago

ClawArena: Benchmarking AI Agents in Evolving Information Environments

Paper • 2604.04202 • Published 19 days ago • 37

upvoted a paper 19 days ago

ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers

Paper • 2603.24414 • Published 29 days ago • 183

upvoted 2 papers 29 days ago

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Paper • 2603.23497 • Published about 1 month ago • 91

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

Paper • 2603.22212 • Published Mar 23 • 126

Xirui Li PRO

AI & ML interests

Recent Activity

Organizations

AIcell's activity