Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
Paper • 2602.15772 • Published • 7
| 文件 | 体积(约) | 内容 | 主权重 / 关键产物 |
|---|---|---|---|
Bagel_caption_thinking.tar.gz |
~54 GB | SFT 工程:Nano250K Reasoning Edit 监督微调代码、配置、脚本 | results/250K/checkpoints/0000500/(step 500) |
R3_odd_data_1K.tar |
~72 GB | 理解侧 RL:仅优化文本推理/理解,不反传图像生成 | ckpt-000350(online_rl_und8k_edit_7p6k) |
R3_odd_data_1K-img-extract.tar |
~59 GB | 生成侧:优化扩散图像编辑,MoT 理解支路 detach | ckpt-000150(pairscore_remix_scope_gate) |
Bagel_eval.tar |
~25 GB | 评测工程:ImgEdit / 理解 benchmark 脚本、benchmark 数据、历史评测结果 | 无训练权重;含 eval/、scripts/eval/、results/ |
Bagel_caption_thinking.tar.gz — SFT
tar -xzf Bagel_caption_thinking.tar.gz
cd Bagel_caption_thinking
README_SFT.mdR3_odd_data_1K.tar — 理解侧 RL
skip_image_gen=True,reward_fn=entity_diff_vlm,仅训理解/文本 CoTUnd8K-Edit-7.6K.tar(~7.6K 编辑对,包内另附)hf download wyjlu/Youtu-SFT R3_odd_data_1K.tar
tar -xf R3_odd_data_1K.tar
tar -xf R3_odd_data_1K/Und8K-Edit-7.6K.tar -C R3_odd_data_1K/data/rl_train/
README_RL_understanding.mdR3_odd_data_1K-img-extract.tar — 生成侧 RL
skip_image_gen=False,train_generation_only=True,reward_fn=edit_pair_score_vlmUnd8K-Edit-7.6K.tar(~7.6K 编辑对,包内另附)hf download wyjlu/Youtu-SFT R3_odd_data_1K-img-extract.tar
tar -xf R3_odd_data_1K-img-extract.tar
tar -xf R3_odd_data_1K-img-extract/ThinkEdit-ge7-weakuniq-plus-new-add.tar \
-C R3_odd_data_1K-img-extract/data/rl_train/
export THINKEDIT_DATA_ROOT="$(pwd)/R3_odd_data_1K-img-extract/data/rl_train/ThinkEdit-ge7-weakuniq-plus-new-add"
README_RL_generation.mdBagel_eval.tar — 评测
eval/vlm/、eval/gen/ — 评测代码与 benchmark 数据scripts/eval/ — 统一启动脚本(如 run_imgedit_ckpt000150.sh、run_eval_vlm.sh)results/ — 各次评测推理输出hf download wyjlu/Youtu-SFT Bagel_eval.tar
tar -xf Bagel_eval.tar
cd eval # 包内顶层目录名
MODEL_PATH / BAGEL_ROOTBAGEL-7B-MoT
↓ Bagel_caption_thinking(SFT step-500)
↓ R3_odd_data_1K(理解侧 RL → ckpt-350)
↓ R3_odd_data_1K-img-extract(生成侧 RL → ckpt-150)
↳ Bagel_eval(各阶段 checkpoint 评测)
ema.safetensors + config + tokenizerMASTER_ADDR、REWARD_SERVER_URLS 等(见各包 README)pip install -U huggingface_hub
hf download wyjlu/Youtu-SFT <文件名>
@misc{ye2026understandingvsgenerationnavigating,
title={Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models},
author={Sen Ye and Mengde Xu and Shuyang Gu and Di He and Liwei Wang and Han Hu},
year={2026},
eprint={2602.15772},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.15772},
}