Datasets MMInstruction/Clevr_CoGenT_TrainA_R1 Viewer β’ Updated Feb 13, 2025 β’ 37.8k β’ 318 β’ 48
Embedding nvidia/MM-Embed 8B β’ Updated Nov 6, 2024 β’ 831 β’ 66 jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper β’ 2412.08802 β’ Published Dec 11, 2024 β’ 7 nvidia/NV-Embed-v2 Feature Extraction β’ 8B β’ Updated Jul 21, 2025 β’ 38.8k β’ 511 Qwen/Qwen3-VL-Embedding-2B Sentence Similarity β’ 2B β’ Updated Apr 16 β’ 1.2M β’ β’ 412
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper β’ 2412.08802 β’ Published Dec 11, 2024 β’ 7
VLMs Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper β’ 2409.12191 β’ Published Sep 18, 2024 β’ 80 Multimodal Latent Language Modeling with Next-Token Diffusion Paper β’ 2412.08635 β’ Published Dec 11, 2024 β’ 49 AIDC-AI/Ovis2-2B Image-Text-to-Text β’ 2B β’ Updated Aug 15, 2025 β’ 337 β’ 60 DAMO-NLP-SG/VideoLLaMA3-2B Video-Text-to-Text β’ 2B β’ Updated Sep 3, 2025 β’ 2.81k β’ 21
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper β’ 2409.12191 β’ Published Sep 18, 2024 β’ 80
Multimodal Latent Language Modeling with Next-Token Diffusion Paper β’ 2412.08635 β’ Published Dec 11, 2024 β’ 49
Text-to-Image black-forest-labs/FLUX.1-dev Text-to-Image β’ Updated Jun 27, 2025 β’ 716k β’ β’ 12.9k
CLIP series Wasserstein Contrastive Representation Distillation Paper β’ 2012.08674 β’ Published Dec 15, 2020 nvidia/MM-Embed 8B β’ Updated Nov 6, 2024 β’ 831 β’ 66 google/siglip2-base-patch16-224 Zero-Shot Image Classification β’ 0.4B β’ Updated Feb 21, 2025 β’ 447k β’ 101
google/siglip2-base-patch16-224 Zero-Shot Image Classification β’ 0.4B β’ Updated Feb 21, 2025 β’ 447k β’ 101
LLMs Phi-4 Technical Report Paper β’ 2412.08905 β’ Published Dec 12, 2024 β’ 123 Qwen/Qwen2.5-Omni-7B Any-to-Any β’ 11B β’ Updated Apr 30, 2025 β’ 912k β’ 1.9k
Text-to-Image black-forest-labs/FLUX.1-dev Text-to-Image β’ Updated Jun 27, 2025 β’ 716k β’ β’ 12.9k
Datasets MMInstruction/Clevr_CoGenT_TrainA_R1 Viewer β’ Updated Feb 13, 2025 β’ 37.8k β’ 318 β’ 48
Embedding nvidia/MM-Embed 8B β’ Updated Nov 6, 2024 β’ 831 β’ 66 jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper β’ 2412.08802 β’ Published Dec 11, 2024 β’ 7 nvidia/NV-Embed-v2 Feature Extraction β’ 8B β’ Updated Jul 21, 2025 β’ 38.8k β’ 511 Qwen/Qwen3-VL-Embedding-2B Sentence Similarity β’ 2B β’ Updated Apr 16 β’ 1.2M β’ β’ 412
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper β’ 2412.08802 β’ Published Dec 11, 2024 β’ 7
CLIP series Wasserstein Contrastive Representation Distillation Paper β’ 2012.08674 β’ Published Dec 15, 2020 nvidia/MM-Embed 8B β’ Updated Nov 6, 2024 β’ 831 β’ 66 google/siglip2-base-patch16-224 Zero-Shot Image Classification β’ 0.4B β’ Updated Feb 21, 2025 β’ 447k β’ 101
google/siglip2-base-patch16-224 Zero-Shot Image Classification β’ 0.4B β’ Updated Feb 21, 2025 β’ 447k β’ 101
VLMs Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper β’ 2409.12191 β’ Published Sep 18, 2024 β’ 80 Multimodal Latent Language Modeling with Next-Token Diffusion Paper β’ 2412.08635 β’ Published Dec 11, 2024 β’ 49 AIDC-AI/Ovis2-2B Image-Text-to-Text β’ 2B β’ Updated Aug 15, 2025 β’ 337 β’ 60 DAMO-NLP-SG/VideoLLaMA3-2B Video-Text-to-Text β’ 2B β’ Updated Sep 3, 2025 β’ 2.81k β’ 21
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper β’ 2409.12191 β’ Published Sep 18, 2024 β’ 80
Multimodal Latent Language Modeling with Next-Token Diffusion Paper β’ 2412.08635 β’ Published Dec 11, 2024 β’ 49
LLMs Phi-4 Technical Report Paper β’ 2412.08905 β’ Published Dec 12, 2024 β’ 123 Qwen/Qwen2.5-Omni-7B Any-to-Any β’ 11B β’ Updated Apr 30, 2025 β’ 912k β’ 1.9k