text-embeddings-inference documentation
Serving private and gated models
Getting started
Tutorials
Using TEI locally with CPUUsing TEI locally with MetalUsing TEI locally with GPUServing private and gated modelsBuild custom container for TEIUsing TEI container with Intel HardwareUsing TEI on AMD Instinct GPUs (ROCm)Example uses
Deploying TEI on Google Cloud
Reference
Serving private and gated models
If the model you wish to serve is behind gated access or resides in a private model repository on Hugging Face Hub, you will need to have access to the model to serve it.
Once you have confirmed that you have access to the model:
- Navigate to your account’s Profile | Settings | Access Tokens page.
- Generate and copy a read token.
If you’re the CLI, set the HF_TOKEN environment variable. For example:
export HF_TOKEN=<YOUR READ TOKEN>
Alternatively, you can provide the token when deploying the model with Docker:
model=<your private model> volume=$PWD/data token=<your cli Hugging Face Hub token> docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model