Instructions to use SeanScripts/NVLM-D-72B-nf4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SeanScripts/NVLM-D-72B-nf4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="SeanScripts/NVLM-D-72B-nf4", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import NVLM_D
model = NVLM_D.from_pretrained("SeanScripts/NVLM-D-72B-nf4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use SeanScripts/NVLM-D-72B-nf4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SeanScripts/NVLM-D-72B-nf4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SeanScripts/NVLM-D-72B-nf4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/SeanScripts/NVLM-D-72B-nf4

SGLang

How to use SeanScripts/NVLM-D-72B-nf4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SeanScripts/NVLM-D-72B-nf4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SeanScripts/NVLM-D-72B-nf4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SeanScripts/NVLM-D-72B-nf4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SeanScripts/NVLM-D-72B-nf4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use SeanScripts/NVLM-D-72B-nf4 with Docker Model Runner:
```
docker model run hf.co/SeanScripts/NVLM-D-72B-nf4
```

Converted using BitsAndBytes to NF4 (with double quantization) from nvidia/NVLM-D-72B. The model belongs to Nvidia and has the Creative Commons Attribution Non Commercial 4.0 license.

This quantization seems to work fine when only using text, but I haven't been able to get coherent responses when an image is included. Work in progress, I could use some help figuring this out.

I made a slight modification to the modeling_intern_vit.py file by replacing a few occurrences like torch.matmul(x, linearmodule.weight.t()) + linearmodule.bias with linearmodule(x). I'm not sure why these linear module applications were written this way, when it's equivalent but fails when the module is quantized because it's accessing the weight directly instead of using the module. Making this change makes the model "work" by at least not giving any errors when trying to run it, but I still haven't been able to get coherent outputs when sending images.

It might have something to do with how the QKV modules were packed, not playing well with quantization. I'll look into how they can be split into regular Q, K, and V tensors later. Or maybe someone else would like to help.

I also modified the generate call in modeling_nvlm_d.py slightly by having it not force use_cache=True, because this was causing an issue for me with cache tensors being on the wrong GPU if I tried to use the model more than once.

Requires at least 48 GB of VRAM. Probably still can't have very long context with only 48 GB though.

Downloads last month: 6

Safetensors

Model size

82B params

Tensor type

F32

BF16

Model tree for SeanScripts/NVLM-D-72B-nf4

Base model

nvidia/NVLM-D-72B

Quantized

(1)

this model