OpenAI-Compatible API

rbee exposes an OpenAI-compatible HTTP API, allowing you to use existing OpenAI client libraries and tools with your self-hosted infrastructure.

Critical: /openai Prefix Required

The OpenAI-compatible endpoints are at /openai/v1/*, not /v1/*.

❌ Wrong: http://localhost:7833/v1/chat/completions
✅ Right: http://localhost:7833/openai/v1/chat/completions

Base URL


http://localhost:7833/openai/v1

Replace localhost:7833 with your queen’s address.

Note the /openai prefix! This distinguishes OpenAI-compatible endpoints from rbee’s native job-based API.

Authentication

Open source queen: No authentication by default.

Premium Queen: API key authentication via Authorization header:


curl -X POST http://localhost:7833/openai/v1/chat/completions \\
  -H "Authorization: Bearer your-api-key-here" \\
  -H "Content-Type: application/json" \\
  -d '...'

Endpoints

Chat completions

Endpoint: POST /openai/v1/chat/completions

Description: Generate a chat completion using a language model.

Request body:


{
  "model": "llama-3.1-8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."}
  ],
  "temperature": 0.7,
  "max_tokens": 500,
  "stream": false
}

Parameters:

Parameter	Type	Required	Description
`model`	string	Yes	Model name (must match a running worker)
`messages`	array	Yes	Array of message objects with `role` and `content`
`temperature`	float	No	Sampling temperature (0.0-2.0, default: 1.0)
`max_tokens`	integer	No	Maximum tokens to generate (default: model max)
`stream`	boolean	No	Enable streaming response (default: false)
`top_p`	float	No	Nucleus sampling (0.0-1.0, default: 1.0)
`frequency_penalty`	float	No	Frequency penalty (-2.0-2.0, default: 0.0)
`presence_penalty`	float	No	Presence penalty (-2.0-2.0, default: 0.0)
`stop`	array	No	Stop sequences (array of strings)

Response (non-streaming):


{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "llama-3.1-8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing is..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 150,
    "total_tokens": 175
  }
}

Response (streaming):

Server-Sent Events (SSE) stream with chunks:


data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1704067200,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":"Quantum"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1704067200,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"content":" computing"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1704067200,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

List models

Endpoint: GET /openai/v1/models

Description: List all available models (running workers).

Response:


{
  "object": "list",
  "data": [
    {
      "id": "llama-3.1-8b",
      "object": "model",
      "created": 1704067200,
      "owned_by": "rbee",
      "permission": [],
      "root": "llama-3.1-8b",
      "parent": null
    },
    {
      "id": "llama-3.2-1b",
      "object": "model",
      "created": 1704067200,
      "owned_by": "rbee",
      "permission": [],
      "root": "llama-3.2-1b",
      "parent": null
    }
  ]
}

Retrieve model

Endpoint: GET /openai/v1/models/{model}

Description: Get details about a specific model.

Response:


{
  "id": "llama-3.1-8b",
  "object": "model",
  "created": 1704067200,
  "owned_by": "rbee",
  "permission": [],
  "root": "llama-3.1-8b",
  "parent": null
}

Client libraries

rbee works with any OpenAI-compatible client library. Just change the base URL to include /openai.

python

from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:7833/openai",  # ← /openai prefix!
  api_key="not-needed"  # Or your API key for Premium Queen
)

response = client.chat.completions.create(
  model="llama-3.1-8b",
  messages=[
      {"role": "user", "content": "Hello!"}
  ]
)

print(response.choices[0].message.content)

Streaming Example

python

stream = client.chat.completions.create(
  model="llama-3.1-8b",
  messages=[{"role": "user", "content": "Tell me a story"}],
  stream=True
)

for chunk in stream:
  if chunk.choices[0].delta.content:
      print(chunk.choices[0].delta.content, end="")

Error responses

rbee returns standard HTTP error codes:

400 Bad Request

Invalid request parameters:


{
  "error": {
    "message": "Invalid model: model-xyz not found",
    "type": "invalid_request_error",
    "code": "model_not_found"
  }
}

429 Too Many Requests

Premium Queen only. Quota exceeded:


{
  "error": {
    "message": "Rate limit exceeded. Try again in 60 seconds.",
    "type": "rate_limit_error",
    "code": "quota_exceeded"
  }
}

500 Internal Server Error

Worker error or system failure:


{
  "error": {
    "message": "Worker crashed during inference",
    "type": "server_error",
    "code": "worker_error"
  }
}

503 Service Unavailable

No workers available for the requested model:


{
  "error": {
    "message": "No workers available for model: llama-3.1-70b",
    "type": "service_unavailable",
    "code": "no_workers"
  }
}

Rate limiting

Premium Queen only.

Rate limit headers are included in responses:


X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1704067260

Request tracing

Premium Queen only.

Each request gets a unique trace ID for debugging:


X-Trace-ID: trace-abc-123-def-456

Include this ID when reporting issues.

Differences from OpenAI API

rbee aims for compatibility but has some differences:

Supported features

✅ Chat completions (streaming and non-streaming)
✅ Model listing
✅ Temperature, top_p, max_tokens
✅ Stop sequences
✅ System/user/assistant roles

Not supported (yet)

❌ Function calling / tool use (roadmap: v0.2)
❌ Vision models (roadmap: v0.3)
❌ Audio transcription via API (use dedicated worker)
❌ Embeddings endpoint (roadmap: v0.2)
❌ Fine-tuning API
❌ Moderation endpoint

rbee-specific extensions

Custom headers:

X-Hive-Preference - Prefer specific hive for request
X-Priority - Request priority (Premium Worker only)
X-Trace-ID - Request tracing (Premium Queen only)

Example:


curl -X POST http://localhost:7833/v1/chat/completions \\
  -H "X-Hive-Preference: gpu-01" \\
  -H "X-Priority: high" \\
  -H "Content-Type: application/json" \\
  -d '...'

Performance tips

Use streaming for long responses

Streaming reduces perceived latency:


stream = client.chat.completions.create(
    model="llama-3.1-70b",
    messages=[...],
    stream=True  # Enable streaming
)

Batch similar requests

Premium Worker only. Send multiple requests quickly to benefit from automatic batching:


import asyncio
 
async def send_request(prompt):
    return await client.chat.completions.create(
        model="llama-3.1-8b",
        messages=[{"role": "user", "content": prompt}]
    )
 
# Send 10 requests concurrently
results = await asyncio.gather(*[
    send_request(f"Question {i}") for i in range(10)
])

Prefer smaller models when possible

llama-3.2-1b - Fast, good for simple tasks
llama-3.1-8b - Balanced performance
llama-3.1-70b - High quality, slower

Set appropriate max_tokens

Don’t request more tokens than needed:


response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[...],
    max_tokens=100  # Limit response length
)

Next steps

Getting started - Set up rbee
Premium modules - Advanced API features
Architecture - How requests are routed