More Reasonable Billing for Batched Inference | Voters

More Reasonable Billing for Batched Inference

in progress

Qingyun Li

Currently downloading the model, which is used for inference, can be expensive. For example, downloading Llama 3.3 70B takes over an hour (still running, so it might be multiple hours even).

If preparation steps such as this weren't billed, and only the actual compute was, that would be great.

December 9, 2024

M.R.

marked this post as

in progress

Mike Henry

Appreciate the feedback, we're going to switch it to only bill when the GPU is actively processing.