The space of possible improvements for your AI model is large while evaluation is costly.
So I was excited to discover the ICML 2026 paper from Kobalczyk, Lin, Letham, Zhao, Balandat, and Bakshy titled "LILO: Bayesian Optimization with Natural Language Feedback."
The method learns efficiently from expert preferences, balancing exploration and exploitation in a principled way with Bayesian Optimization for expensive-to-evaluate black-box objectives.
Experimenting with the technique, I trained a Gaussian Process proxy model on the implicit preferences in my code repo's commit history at VQASynth.
The result: I used the model's preference scores to re-rank candidate papers recommended based on my interests in spatial reasoning and multimodal data synthesis.
Semantic relevance is a high-recall method for finding arXiv papers personalized to your interests. Adding contributor preferences, extracted from the merge history of your code offers a high-precision filter.
So what's next? I'm using the model to synthesize a larger volume of preference data to finetune an open-weight coding model with DPO and LoRA. Tuning Coding Agents via Implicit Preference Distillation
It's an open-source Python project for evaluating and replaying robotics and world-model workflows.
The useful part is not only calling a model. WorldForge records the run, validates action shapes, translates outputs into actions, and keeps replay artifacts you can inspect later.
The current demo uses LeRobot + LeWorldModel on PushT through the official loader:
The harness also has replay-only paths for Cosmos-Policy and GR00T-style outputs, so you can inspect the provider contract from saved artifacts without keeping a GPU server online.
Try it:
pip install worldforge-ai uv run --extra harness worldforge-harness --flow robotics-compare