API Parameters Support

Completion API

Supported Parameters

  • model - ID of the model to use

  • prompt - The prompt(s) to generate completions for

  • suffix - The suffix that comes after a completion

  • max_tokens - Maximum number of tokens to generate

  • temperature - Sampling temperature (0-2)

  • top_p - Nucleus sampling parameter

  • n - Number of completions to generate

  • stream - Whether to stream back partial progress

  • logprobs - Include log probabilities of tokens

  • echo - Echo back the prompt

  • stop - Sequences where generation should stop

  • presence_penalty - Penalize new tokens based on presence

  • frequency_penalty - Penalize tokens based on frequency

  • best_of - Generate multiple completions and return best

  • logit_bias - Modify likelihood of specified tokens

  • user - Unique identifier representing end-user

Chat API

Supported Parameters

  • model - ID of the model to use

  • messages - List of conversation messages

  • temperature - Sampling temperature (0-2)

  • top_p - Nucleus sampling parameter

  • n - Number of chat completion choices

  • stream - Stream partial message deltas

  • stop - Sequences where generation should stop

  • max_tokens - Maximum tokens to generate

  • presence_penalty - Penalize new tokens based on presence

  • frequency_penalty - Penalize tokens based on frequency

  • logit_bias - Modify likelihood of specified tokens

  • user - Unique identifier representing end-user

  • transforms - Array of transformation strategies to apply when needed (see below)

Message Transforms

The Chat API supports message transformations through the transforms parameter to handle situations where conversations exceed the model's maximum context size or message count limits.

Available Transforms

  • middle-out - Intelligently compresses conversations by removing messages from the middle while preserving context from the beginning and end.

How Middle-Out Transform Works

When middle-out compression is enabled, the API will:

  1. Check if the conversation exceeds the model's context length or message count limits

  2. If within limits, no transformation is applied

  3. If limits are exceeded:

    • Preserve all system messages to maintain important instructions

    • Keep messages from the beginning of the conversation for essential context

    • Keep recent messages from the end of the conversation

    • Remove messages from the middle

    • Add a basic summary message indicating what was removed

The algorithm targets about 80% of the model's maximum context size or message count limit, providing a buffer while preserving as much context as possible.

Token Limits

For token context limits, the algorithm:

  • Estimates the average tokens per message

  • Calculates how many messages to remove to reach the target token count

  • Distributes the kept messages evenly between the start and end of the conversation

Message Count Limits

For models with message count limits (e.g., Claude with ~1000 message limit):

  • Targets 80% of the maximum allowed messages

  • Keeps a balanced number of messages from the beginning and end

  • Adds a single summary message to maintain continuity

Usage Example

{
  "model": "mistralai/mistral-small-24b-instruct-2501",
  "transforms": ["middle-out"],
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    // ... many more messages ...
    {"role": "user", "content": "What's the weather today?"}
  ]
}

When this feature is enabled with the Chat API, you can work with very long conversations that would otherwise exceed model limits. The API will intelligently compress the conversation while keeping the most relevant parts, ensuring continuity and context preservation.

Last updated