Instructions to use amazon/MistralLite-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amazon/MistralLite-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="amazon/MistralLite-AWQ")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("amazon/MistralLite-AWQ") model = AutoModelForCausalLM.from_pretrained("amazon/MistralLite-AWQ") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use amazon/MistralLite-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "amazon/MistralLite-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amazon/MistralLite-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/amazon/MistralLite-AWQ
- SGLang
How to use amazon/MistralLite-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "amazon/MistralLite-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amazon/MistralLite-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "amazon/MistralLite-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amazon/MistralLite-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use amazon/MistralLite-AWQ with Docker Model Runner:
docker model run hf.co/amazon/MistralLite-AWQ
MistralLite-AWQ Model
MistralLite-AWQ is a version of the MistralLite model that was quantized using the AWQ method developed by Lin et al. (2023). The MistralLite-AWQ models are approximately 70% smaller than those of MistralLite whilst maintaining comparable performance.
Please refer to the original MistralLite model card for details about the model preparation and training processes.
MistralLite-AWQ Variants
| Branch | Approx. Model Size | q_group_size |
w_bit |
version |
|---|---|---|---|---|
| main | 3.9 GB | 128 | 4 | GEMM |
| MistralLite-AWQ-64g-4b-GEMM | 4.0 GB | 64 | 4 | GEMM |
| MistralLite-AWQ-32g-4b-GEMM | 4.3 GB | 32 | 4 | GEMM |
Dependencies
autoawq==0.2.5– AutoAWQ was used to quantize the MistralLite model.vllm==0.4.2– vLLM was used to host models for benchmarking.
Evaluations
Long Context
The following benchmark results are shown as accuracy (%) values, unless stated otherwise.
Topic Retrieval
See https://lmsys.org/blog/2023-06-29-longchat/
| Model Name | n_topics=05 | n_topics=10 | n_topics=15 | n_topics=20 | n_topics=25 |
|---|---|---|---|---|---|
| n_tokens (approx.) = | 3048 | 5966 | 8903 | 11832 | 14757 |
| MistralLite | 100 | 100 | 100 | 100 | 98 |
| MistralLite-AWQ | 100 | 100 | 100 | 100 | 98 |
| MistralLite-AWQ-64g-4b-GEMM | 100 | 100 | 100 | 100 | 98 |
| MistralLite-AWQ-32g-4b-GEMM | 100 | 100 | 100 | 100 | 98 |
| Mistral-7B-Instruct-v0.1 | 96 | 52 | 2 | 0 | 0 |
| Mistral-7B-Instruct-v0.2 | 100 | 100 | 100 | 100 | 100 |
| Mixtral-8x7B-v0.1 | 0 | 0 | 0 | 0 | 0 |
| Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 100 | 100 |
Line Retrieval
See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results
| Model Name | n_lines=200 | n_lines=300 | n_lines=400 | n_lines=500 | n_lines=600 | n_lines=680 |
|---|---|---|---|---|---|---|
| n_tokens (approx.) = | 4317 | 6415 | 8510 | 10610 | 12698 | 14373 |
| MistralLite | 100 | 94 | 86 | 82 | 76 | 66 |
| MistralLite-AWQ | 96 | 94 | 88 | 80 | 70 | 62 |
| MistralLite-AWQ-64g-4b-GEMM | 96 | 96 | 90 | 70 | 72 | 60 |
| MistralLite-AWQ-32g-4b-GEMM | 98 | 96 | 84 | 76 | 70 | 62 |
| Mistral-7B-Instruct-v0.1 | 96 | 56 | 38 | 36 | 30 | 30 |
| Mistral-7B-Instruct-v0.2 | 100 | 100 | 96 | 98 | 96 | 84 |
| Mixtral-8x7B-v0.1 | 54 | 38 | 56 | 66 | 62 | 38 |
| Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 100 | 100 | 100 |
Pass Key Retrieval
See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101
| Model Name | n_garbage=12000 | n_garbage=20000 | n_garbage=31000 | n_garbage=38000 | n_garbage=45000 | n_garbage=60000 |
|---|---|---|---|---|---|---|
| n_tokens (approx.) = | 3272 | 5405 | 8338 | 10205 | 12071 | 16072 |
| MistralLite | 100 | 100 | 100 | 100 | 100 | 100 |
| MistralLite-AWQ | 100 | 100 | 100 | 100 | 100 | 100 |
| MistralLite-AWQ-64g-4b-GEMM | 100 | 100 | 100 | 100 | 100 | 100 |
| MistralLite-AWQ-32g-4b-GEMM | 100 | 100 | 100 | 100 | 100 | 100 |
| Mistral-7B-Instruct-v0.1 | 100 | 50 | 30 | 20 | 10 | 10 |
| Mistral-7B-Instruct-v0.2 | 100 | 100 | 100 | 100 | 100 | 100 |
| Mixtral-8x7B-v0.1 | 100 | 100 | 100 | 100 | 100 | 100 |
| Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 90 | 100 | 100 |
QuALITY (Question Answering with Long Input Texts, Yes!)
See https://nyu-mll.github.io/quality/
| Model Name | Test set Accuracy | Hard subset Accuracy |
|---|---|---|
| MistralLite | 56.8 | 74.5 |
| MistralLite-AWQ | 55.3 | 71.8 |
| MistralLite-AWQ-64g-4b-GEMM | 55.2 | 72.9 |
| MistralLite-AWQ-32g-4b-GEMM | 56.6 | 72.8 |
| Mistral-7B-Instruct-v0.1 | 45.2 | 58.9 |
| Mistral-7B-Instruct-v0.2 | 55.5 | 74 |
| Mixtral-8x7B-v0.1 | 75 | 74.1 |
| Mixtral-8x7B-Instruct-v0.1 | 68.7 | 83.3 |
Usage
Inference via vLLM HTTP Host
Launch Host
python -m vllm.entrypoints.openai.api_server \
--model amazon/MistralLite-AWQ \
--quantization awq
Query Host
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{ "model": "amazon/MistralLite-AWQ",
"prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
"temperature": 0,
"echo": false
}'
Inference via vLLM Offline Inference
from vllm import LLM, SamplingParams
prompts = [
"<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
]
sampling_params = SamplingParams(temperature=0, max_tokens=100)
llm = LLM(model="amazon/MistralLite-AWQ")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
License
Apache 2.0
Limitations
Before using the MistralLite-AWQ model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.
- Downloads last month
- 39