Skip to content
Skip to Content
ReferenceOpenAI-Compatible API

OpenAI-Compatible API

rbee exposes an OpenAI-compatible HTTP API, allowing you to use existing OpenAI client libraries and tools with your self-hosted infrastructure.

Base URL

http://localhost:7833/openai/v1

Replace localhost:7833 with your queen’s address.

Authentication

Open source queen: No authentication by default.

Premium Queen: API key authentication via Authorization header:

curl -X POST http://localhost:7833/openai/v1/chat/completions \\ -H "Authorization: Bearer your-api-key-here" \\ -H "Content-Type: application/json" \\ -d '...'

Endpoints

Chat completions

Endpoint: POST /openai/v1/chat/completions

Description: Generate a chat completion using a language model.

Request body:

{ "model": "llama-3.1-8b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum computing in simple terms."} ], "temperature": 0.7, "max_tokens": 500, "stream": false }

Parameters:

ParameterTypeRequiredDescription
modelstringYesModel name (must match a running worker)
messagesarrayYesArray of message objects with role and content
temperaturefloatNoSampling temperature (0.0-2.0, default: 1.0)
max_tokensintegerNoMaximum tokens to generate (default: model max)
streambooleanNoEnable streaming response (default: false)
top_pfloatNoNucleus sampling (0.0-1.0, default: 1.0)
frequency_penaltyfloatNoFrequency penalty (-2.0-2.0, default: 0.0)
presence_penaltyfloatNoPresence penalty (-2.0-2.0, default: 0.0)
stoparrayNoStop sequences (array of strings)

Response (non-streaming):

{ "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1704067200, "model": "llama-3.1-8b", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Quantum computing is..." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 25, "completion_tokens": 150, "total_tokens": 175 } }

Response (streaming):

Server-Sent Events (SSE) stream with chunks:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1704067200,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":"Quantum"},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1704067200,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"content":" computing"},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1704067200,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE]

List models

Endpoint: GET /openai/v1/models

Description: List all available models (running workers).

Response:

{ "object": "list", "data": [ { "id": "llama-3.1-8b", "object": "model", "created": 1704067200, "owned_by": "rbee", "permission": [], "root": "llama-3.1-8b", "parent": null }, { "id": "llama-3.2-1b", "object": "model", "created": 1704067200, "owned_by": "rbee", "permission": [], "root": "llama-3.2-1b", "parent": null } ] }

Retrieve model

Endpoint: GET /openai/v1/models/{model}

Description: Get details about a specific model.

Response:

{ "id": "llama-3.1-8b", "object": "model", "created": 1704067200, "owned_by": "rbee", "permission": [], "root": "llama-3.1-8b", "parent": null }

Client libraries

rbee works with any OpenAI-compatible client library. Just change the base URL to include /openai.

python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:7833/openai", # ← /openai prefix!
api_key="not-needed" # Or your API key for Premium Queen
)
response = client.chat.completions.create(
model="llama-3.1-8b",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)

Streaming Example

python
stream = client.chat.completions.create(
model="llama-3.1-8b",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")

Error responses

rbee returns standard HTTP error codes:

400 Bad Request

Invalid request parameters:

{ "error": { "message": "Invalid model: model-xyz not found", "type": "invalid_request_error", "code": "model_not_found" } }

429 Too Many Requests

Premium Queen only. Quota exceeded:

{ "error": { "message": "Rate limit exceeded. Try again in 60 seconds.", "type": "rate_limit_error", "code": "quota_exceeded" } }

500 Internal Server Error

Worker error or system failure:

{ "error": { "message": "Worker crashed during inference", "type": "server_error", "code": "worker_error" } }

503 Service Unavailable

No workers available for the requested model:

{ "error": { "message": "No workers available for model: llama-3.1-70b", "type": "service_unavailable", "code": "no_workers" } }

Rate limiting

Premium Queen only.

Rate limit headers are included in responses:

X-RateLimit-Limit: 100 X-RateLimit-Remaining: 95 X-RateLimit-Reset: 1704067260

Request tracing

Premium Queen only.

Each request gets a unique trace ID for debugging:

X-Trace-ID: trace-abc-123-def-456

Include this ID when reporting issues.

Differences from OpenAI API

rbee aims for compatibility but has some differences:

Supported features

  • ✅ Chat completions (streaming and non-streaming)
  • ✅ Model listing
  • ✅ Temperature, top_p, max_tokens
  • ✅ Stop sequences
  • ✅ System/user/assistant roles

Not supported (yet)

  • ❌ Function calling / tool use (roadmap: v0.2)
  • ❌ Vision models (roadmap: v0.3)
  • ❌ Audio transcription via API (use dedicated worker)
  • ❌ Embeddings endpoint (roadmap: v0.2)
  • ❌ Fine-tuning API
  • ❌ Moderation endpoint

rbee-specific extensions

Custom headers:

  • X-Hive-Preference - Prefer specific hive for request
  • X-Priority - Request priority (Premium Worker only)
  • X-Trace-ID - Request tracing (Premium Queen only)

Example:

curl -X POST http://localhost:7833/v1/chat/completions \\ -H "X-Hive-Preference: gpu-01" \\ -H "X-Priority: high" \\ -H "Content-Type: application/json" \\ -d '...'

Performance tips

Use streaming for long responses

Streaming reduces perceived latency:

stream = client.chat.completions.create( model="llama-3.1-70b", messages=[...], stream=True # Enable streaming )

Batch similar requests

Premium Worker only. Send multiple requests quickly to benefit from automatic batching:

import asyncio async def send_request(prompt): return await client.chat.completions.create( model="llama-3.1-8b", messages=[{"role": "user", "content": prompt}] ) # Send 10 requests concurrently results = await asyncio.gather(*[ send_request(f"Question {i}") for i in range(10) ])

Prefer smaller models when possible

  • llama-3.2-1b - Fast, good for simple tasks
  • llama-3.1-8b - Balanced performance
  • llama-3.1-70b - High quality, slower

Set appropriate max_tokens

Don’t request more tokens than needed:

response = client.chat.completions.create( model="llama-3.1-8b", messages=[...], max_tokens=100 # Limit response length )

Next steps

2025 © rbee. Your private AI cloud, in one command.
GitHubrbee.dev