Integrate with Sentence Transformers v5.4
Hello!
Pull Request overview
- Integrate nomic-embed-vision-v1.5 with Sentence Transformers v5.4+
Details
The integration uses a Transformer -> Pooling(cls) -> Normalize pipeline with modality_config set to {"image": {"method": "forward", "method_output_name": "last_hidden_state"}} so the model accepts image inputs. A processor_config.json was added to ensure AutoProcessor loads the CLIPImageProcessor instead of falling back to a tokenizer (since model_type: nomic_bert would otherwise resolve to BertTokenizer).
Note: this model requires https://huggingface.co/nomic-ai/nomic-bert-2048/discussions/23 to fix three transformers v5 compatibility issues in the shared modeling code:
- Adding
self.post_init()toNomicVisionModel.__init__(required forall_tied_weights_keys) - Lazy recomputation of rotary position embeddings in
NomicVisionRotaryEmbeddingCat.get_embed(non-persistent buffers are not materialized whenfrom_pretrainedinitializes ontorch.device("meta")in v5) - Replacing
self.norm_factorbuffer with inlinemath.sqrt(self.head_dim)inNomicAttentionPoolingandNomicBertAttention(same meta-device issue)
Added files:
modules.json: Defines the Transformer -> Pooling -> Normalize pipelineconfig_sentence_transformers.json: ST model config with cosine similaritysentence_bert_config.json: Transformer config with image modality_config1_Pooling/config.json: CLS pooling mode, 768-dim embeddingsprocessor_config.json: EnsuresAutoProcessorloadsCLIPImageProcessor
Modified files:
config.json: Fixedn_innerfrom2048.0(float) to2048(int) for transformers v5 strict validationREADME.md: Addedsentence-transformerslibrary tag, and a "Using Sentence Transformers" usage section
Here's a script that uses both this PR and the companion PR:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-vision-v1.5", revision="refs/pr/10", model_kwargs={"code_revision": "refs/pr/23"}, trust_remote_code=True)
embeddings = model.encode("http://images.cocodataset.org/val2017/000000039769.jpg")
print(embeddings.shape)
# (768,)
Once both are merged, then the revision/model_kwargs can be excluded.
Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.
- Tom Aarsen
Hello!
This is what I get, for reference:
>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer("nomic-ai/nomic-embed-vision-v1.5", revision="refs/pr/10", model_kwargs={"code_revision": "refs/pr/23"}, trust_remote_code=True)
>>> embeddings = model.encode("http://images.cocodataset.org/val2017/000000039769.jpg")
>>> print(embeddings)
Loading weights: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 211/211 [00:00<00:00, 5870.40it/s]
[ 4.71330713e-03 -2.53534522e-02 6.63616322e-03 -2.95666978e-02
-4.34983559e-02 -1.22364080e-02 2.38989759e-03 -3.60762812e-02
. . .
-4.39791530e-02 -3.05440221e-02 -1.93784963e-02 -1.76065695e-02
-3.54587808e-02 -4.97163460e-02 7.33873341e-03 -3.87372449e-02]
Can you share your torch and transformers versions?
I'm using torch 2.10.0+cu128 and transformers 5.5.0. I know some older torch versions had some issues regarding nan, although you shouldn't need a version as new as 2.10.0.
- Tom Aarsen
