Ability to run model quants on dedicated
Z T
We support quantized models. Model quantized with llm-compressor (https://github.com/vllm-project/llm-compressor) is most well supported, e.g. all the models (INT8, FP8, INT4) under https://huggingface.co/RedHatAI.
GGUF models are not optimized for now.
A
Anonymous
Exl2 support through exllamav2 is much faster in many cases.