Ability to run model quants on dedicated | Voters | Parasail

Powered by Canny

Ability to run model quants on dedicated

C

Cysio

When I tried running a 123B parameter model in dedicated mode, the suggested hardware configuration was, understandably, quite powerful. To limit costs it would be nice to have a way to run quantized models, preferably through either GGUF or iMatrix GGUF files.

Z T

We support quantized models. Model quantized with llm-compressor (https://github.com/vllm-project/llm-compressor) is most well supported, e.g. all the models (INT8, FP8, INT4) under https://huggingface.co/RedHatAI.

GGUF models are not optimized for now.

A

Anonymous

Exl2 support through exllamav2 is much faster in many cases.

Powered by Canny