E5-V: Universal Embeddings with Multimodal Large Language Models

E5-V is fine-tuned based on lmms-lab/llama3-llava-next-8b.

Overview

We propose a framework, called E5-V, to adpat MLLMs for achieving multimodal embeddings. E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We also propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs, demonstrating better performance than multimodal training.

More details can be found in https://github.com/kongds/E5-V

Usage

Using Sentence Transformers

Install Sentence Transformers:

pip install "sentence_transformers[image]"

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("royokong/e5-v")

# Encode text inputs
texts = [
    "A dog sitting in the grass.",
    "A dog standing in the snow.",
    "A cat sitting in the grass.",
    "A cat standing in the snow.",
]
text_embeddings = model.encode(texts)
print(text_embeddings.shape)
# (4, 4096)

# Encode image inputs
images = [
    "https://huggingface.co/royokong/e5-v/resolve/main/assets/dog.jpg",
    "https://huggingface.co/royokong/e5-v/resolve/main/assets/cat.jpg",
]
image_embeddings = model.encode(images)
print(image_embeddings.shape)
# (2, 4096)

# Compute text-image similarities
similarities = model.similarity(text_embeddings, image_embeddings)
print(similarities)
# tensor([[0.7183, 0.3579],
#         [0.5806, 0.5522],
#         [0.4714, 0.6479],
#         [0.4150, 0.8081]])

The model uses a custom chat template that automatically wraps text inputs with the instruction "Summary above sentence in one word:" and image inputs with "Summary above image in one word:".

Using transformers

import torch
import torch.nn.functional as F
import requests
from PIL import Image
from transformers import AutoTokenizer
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

llama3_template = '<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n'

processor = LlavaNextProcessor.from_pretrained('royokong/e5-v')
model = LlavaNextForConditionalGeneration.from_pretrained('royokong/e5-v', torch_dtype=torch.float16).cuda()

img_prompt = llama3_template.format('<image>\nSummary above image in one word: ')
text_prompt = llama3_template.format('<sent>\nSummary above sentence in one word: ')

urls = [
    'https://huggingface.co/royokong/e5-v/resolve/main/assets/dog.jpg',
    'https://huggingface.co/royokong/e5-v/resolve/main/assets/cat.jpg',
]
images = [Image.open(requests.get(url, stream=True).raw) for url in urls]

texts = ['A dog sitting in the grass.',
         'A dog standing in the snow.',
         'A cat sitting in the grass.',
         'A cat standing in the snow.']

text_inputs = processor([text_prompt.replace('<sent>', text) for text in texts], return_tensors="pt", padding=True).to('cuda')
img_inputs = processor([img_prompt]*len(images), images, return_tensors="pt", padding=True).to('cuda')

with torch.no_grad():
    text_embs = model(**text_inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
    img_embs = model(**img_inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]

    text_embs = F.normalize(text_embs, dim=-1)
    img_embs = F.normalize(img_embs, dim=-1)

print(text_embs @ img_embs.t())
# tensor([[0.7275, 0.3630],
#         [0.5957, 0.5522],
#         [0.4709, 0.6406],
#         [0.4202, 0.7974]])

Downloads last month: 16,501

Safetensors

Model size

8B params

Tensor type

F16

Space using royokong/e5-v 1

Paper for royokong/e5-v

E5-V: Universal Embeddings with Multimodal Large Language Models

Paper • 2407.12580 • Published Jul 17, 2024 • 42