MistralLite-AWQ Model

MistralLite-AWQ is a version of the MistralLite model that was quantized using the AWQ method developed by Lin et al. (2023). The MistralLite-AWQ models are approximately 70% smaller than those of MistralLite whilst maintaining comparable performance.

Please refer to the original MistralLite model card for details about the model preparation and training processes.

MistralLite-AWQ Variants

Branch	Approx. Model Size	`q_group_size`	`w_bit`	`version`
main	3.9 GB	128	4	GEMM
MistralLite-AWQ-64g-4b-GEMM	4.0 GB	64	4	GEMM
MistralLite-AWQ-32g-4b-GEMM	4.3 GB	32	4	GEMM

Dependencies

autoawq==0.2.5 – AutoAWQ was used to quantize the MistralLite model.
vllm==0.4.2 – vLLM was used to host models for benchmarking.

Evaluations

Long Context

The following benchmark results are shown as accuracy (%) values, unless stated otherwise.

Topic Retrieval

See https://lmsys.org/blog/2023-06-29-longchat/

Model Name	n_topics=05	n_topics=10	n_topics=15	n_topics=20	n_topics=25
n_tokens (approx.) =	3048	5966	8903	11832	14757
MistralLite	100	100	100	100	98
MistralLite-AWQ	100	100	100	100	98
MistralLite-AWQ-64g-4b-GEMM	100	100	100	100	98
MistralLite-AWQ-32g-4b-GEMM	100	100	100	100	98
Mistral-7B-Instruct-v0.1	96	52	2	0	0
Mistral-7B-Instruct-v0.2	100	100	100	100	100
Mixtral-8x7B-v0.1	0	0	0	0	0
Mixtral-8x7B-Instruct-v0.1	100	100	100	100	100

Line Retrieval

See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results

Model Name	n_lines=200	n_lines=300	n_lines=400	n_lines=500	n_lines=600	n_lines=680
n_tokens (approx.) =	4317	6415	8510	10610	12698	14373
MistralLite	100	94	86	82	76	66
MistralLite-AWQ	96	94	88	80	70	62
MistralLite-AWQ-64g-4b-GEMM	96	96	90	70	72	60
MistralLite-AWQ-32g-4b-GEMM	98	96	84	76	70	62
Mistral-7B-Instruct-v0.1	96	56	38	36	30	30
Mistral-7B-Instruct-v0.2	100	100	96	98	96	84
Mixtral-8x7B-v0.1	54	38	56	66	62	38
Mixtral-8x7B-Instruct-v0.1	100	100	100	100	100	100

Pass Key Retrieval

See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101

Model Name	n_garbage=12000	n_garbage=20000	n_garbage=31000	n_garbage=38000	n_garbage=45000	n_garbage=60000
n_tokens (approx.) =	3272	5405	8338	10205	12071	16072
MistralLite	100	100	100	100	100	100
MistralLite-AWQ	100	100	100	100	100	100
MistralLite-AWQ-64g-4b-GEMM	100	100	100	100	100	100
MistralLite-AWQ-32g-4b-GEMM	100	100	100	100	100	100
Mistral-7B-Instruct-v0.1	100	50	30	20	10	10
Mistral-7B-Instruct-v0.2	100	100	100	100	100	100
Mixtral-8x7B-v0.1	100	100	100	100	100	100
Mixtral-8x7B-Instruct-v0.1	100	100	100	90	100	100

QuALITY (Question Answering with Long Input Texts, Yes!)

See https://nyu-mll.github.io/quality/

Model Name	Test set Accuracy	Hard subset Accuracy
MistralLite	56.8	74.5
MistralLite-AWQ	55.3	71.8
MistralLite-AWQ-64g-4b-GEMM	55.2	72.9
MistralLite-AWQ-32g-4b-GEMM	56.6	72.8
Mistral-7B-Instruct-v0.1	45.2	58.9
Mistral-7B-Instruct-v0.2	55.5	74
Mixtral-8x7B-v0.1	75	74.1
Mixtral-8x7B-Instruct-v0.1	68.7	83.3

Usage

Inference via vLLM HTTP Host

Launch Host

python -m vllm.entrypoints.openai.api_server \
    --model amazon/MistralLite-AWQ \
    --quantization awq

Query Host

curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{ "model": "amazon/MistralLite-AWQ",
          "prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
          "temperature": 0,
          "echo": false
    }'

Inference via vLLM Offline Inference

from vllm import LLM, SamplingParams

prompts = [
   "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
]
sampling_params = SamplingParams(temperature=0, max_tokens=100)

llm = LLM(model="amazon/MistralLite-AWQ")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

License

Apache 2.0

Limitations

Before using the MistralLite-AWQ model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.

Downloads last month: 39

Safetensors

Model size

7B params

Tensor type

I32

F16

Paper for amazon/MistralLite-AWQ

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Paper • 2306.00978 • Published Jun 1, 2023 • 12