Boards

Feature/Bug Requests

Model Request

Bounty Program

Powered by Canny

Feature/Bug Requests

K2 Thinking Text completion

K2 thinking can't be prefilled and is not available on the text completion endpoint. While these are not intended uses for the model they are still incredibly useful features that are enabled for most if not all other models hosted by Parasail. Edit: I should note my assumption was wrong. This is the case for several other models such as both non-thinking K2 snapshots and GLM-4.6. It would be nice if text completion was available for them as well.

Prompt Caching in Serverless?

Is it feasible to get prompt caching working for the serverless options? Through a flag that tries its best to keep you connected with one vLLM instance or being able to specify the instance yourself? I also heard you can save kv cache in the database, allowing consistent prompt caching from any instance.

tool support for Qwen3 VL not flagged on OpenRouter

For your integration with OpenRouter, it looks like tool calling support is not enabled for the qwen3-vl-235b-a22b instruct/thinking models. I'm excited that these models are finally available from a ZDR-compliant provider, but I can't switch to this model until OpenRouter list tool calling as supported for requests routed through Parasail. I reported this on the OpenRouter side, but maybe you can push for the change from the Parasail side as well.

Serverless models endpoint setting wrong "Content-Type" header

The serverless models endpoint is returning the wrong "Content-Type" value in the HTTP header for streaming responses. Currently it returns: 'Content-Type': 'application/json' It should return: 'Content-Type': 'text/event-stream' This breaks some of the OpenAI clients, like open-webui, making parasail not usable!

IP Restrict / Access

A simple configuration allows me to control which IPs are allowed to access the endpoints, needed for company usage.

Secruity/Compliance

Orpheus OpenAI Compatible Speech Endpoint

Hello, right now the orpheus endpoint returns SNAC tokens from the completions endpoint. In order to convert the SNAC tokens to .wav or other audio formats, you must use torch audio and python and have access to a GPU. My use case is in a browser-like environment, I don't have access to a GPU or server resources with a GPU. There are other services like deepinfra that offer an openai compatible API to stream a .wav file directly to the requester so that a GPU or server is not needed https://deepinfra.com/canopylabs/orpheus-3b-0.1-ft/api?example=openai-tts-python This would work great for my environment and I could perform TTS with orpheus without needing a GPU. There are some projects that accomplish this like https://github.com/Lex-au/Orpheus-FastAPI

Able to config price/token limit, monthly spending limit configuration.

The usage base pricing is great, but I would like a simple way for me to limit my spending, maybe just a simple input of number so can have a peace of mind that I won't burn my wallet without knowning it XD

Show a percentage progress when spinning up a model in Dedicated

Download percentage in dedicated when spooling up a model would be greatly appreciated rather than guesstimating its progress.

presence_penalty parameter results in Internal Server Error

When one includes the presence_penalty parameter when polling a model on serverless (tested with Deepseek R1 and Valkyrie), one gets an Internal Server Error.

Tokenizer Endpoint

Hello, I am really expanding now that I can use a VLLM compatible AI service, its amazing to be able to use deepseek and use VLLM parameters. One thing I am misssing is the /tokenize endpoint from VLLM. Is it possible you could expose this endpoint? Here is my use case. I am running in a browser environment, so I can't run a tokenizer locally easily I tokenize a '<<BREAK>>' string or find some token like <unk> or a special reserved token for that model that won't show up normally in a text generated by the AI or during inference I then add that break string at the end of every chat message: https://github.com/guspuffygit/sentient-sims-app/blob/main/src/main/sentient-sims/services/VLLMAIService.ts#L198-L206 Then after that I can truncate the oldest user/assistant messages to a certain token length because I can count how many tokens are in a chat message by finding the break string token Here is the code where I use the tokenized output to truncate chat messages to a certain context length. https://github.com/guspuffygit/sentient-sims-app/blob/main/src/main/sentient-sims/util/tokenTruncate.ts#L17-L69

Load More

→

Powered by Canny