Single Machine Setup

This guide walks you through setting up rbee on a single machine. This is the simplest configuration and perfect for getting started, testing, or running AI workloads on one computer with one or more GPUs.

What you’ll build

By the end of this guide, you’ll have:

A running rbee colony on one machine
The keeper GUI connected to your local queen
At least one worker running an LLM model
The ability to send requests through the OpenAI-compatible API

Prerequisites

rbee installed (see Installation)
At least one GPU (or CPU for testing)
16GB+ RAM recommended
20GB+ free disk space for models

Step 1: Start the queen

The queen is the orchestrator that manages your colony. On a single machine, it runs locally:


# Start the queen on default port (7833)
rbee queen start
 
# Note: Port configuration is handled via queen-rbee daemon args
# See architecture docs for advanced configuration

The queen will:

Start an HTTP API server
Initialize the job registry
Wait for hives to connect

You should see output like:


🐝 Queen started on http://localhost:7833
🐝 Waiting for hives to register...

Step 2: Start the hive

The hive runs on the same machine and hosts workers:


# Start the hive (auto-detects queen on localhost)
rbee hive start
 
# Or specify the queen URL explicitly
rbee hive start --host localhost

The hive will:

Detect available GPUs
Register with the queen
Start sending heartbeats with capability information

You should see:


🐝 Hive started
🐝 Detected GPUs: NVIDIA RTX 3090 (24GB VRAM)
🐝 Registered with queen at http://localhost:7833

Step 3: Download a model

Before spawning a worker, you need a model. rbee can download models from HuggingFace:


# Download a small model for testing (1.3GB)
rbee model download llama-3.2-1b
 
# Or a larger, more capable model (8GB)
rbee model download llama-3.1-8b

Models are stored in ~/.cache/rbee/models/ and shared across all workers.

Step 4: Spawn a worker

Now spawn a worker to run the model:


# Spawn an LLM worker (CUDA)
rbee worker spawn \
  --model llama-3.2-1b \
  --worker cuda \
  --device 0
 
# For CPU-only (slower)
rbee worker spawn \
  --model llama-3.2-1b \
  --worker cpu \
  --device 0

The worker will:

Load the model into GPU/CPU memory
Start an inference server

You should see:


🐝 Worker spawned: worker-abc123
🐝 Model: llama-3.2-1b
🐝 Device: cuda:0
🐝 Status: ready

Step 5: Send a test request

Now test the system with a chat completion request:


curl -X POST http://localhost:7833/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "llama-3.2-1b",
    "messages": [
      {"role": "user", "content": "Explain what rbee does in one sentence."}
    ]
  }'

You should get a streaming response with the model’s answer.

Step 6: Open the keeper GUI (optional)

For a visual interface, start the keeper:


rbee-keeper

The keeper GUI will open and automatically connect to your local queen. You’ll see:

Your hive with GPU information
Active workers and their status
Real-time GPU utilization
A chat interface to interact with models

Verify your setup

Check that everything is running:


# List all workers
rbee worker list
 
# Check worker status (view details)
rbee status
 
# View hive status
rbee hive status --host localhost

What you’ve built

You now have a complete single-machine rbee colony:


┌─────────────────────────────────────┐
│  Your Machine                       │
│                                     │
│  ┌──────────┐                       │
│  │  Keeper  │ (GUI)                 │
│  └────┬─────┘                       │
│       │                             │
│  ┌────▼─────┐                       │
│  │  Queen   │ (Orchestrator)        │
│  └────┬─────┘                       │
│       │                             │
│  ┌────▼─────┐                       │
│  │   Hive   │                       │
│  │          │                       │
│  │  ┌──────┐│                       │
│  │  │Worker││ (LLM on GPU)          │
│  │  └──────┘│                       │
│  └──────────┘                       │
└─────────────────────────────────────┘

Next steps

Add more workers - Run multiple models simultaneously
Scale to multiple machines - Connect other computers
Use the API - Integrate with applications
Monitor performance - Track GPU usage and throughput

Troubleshooting

Worker fails to spawn

Check GPU availability:


nvidia-smi  # For NVIDIA GPUs

Ensure the model is downloaded:


rbee model list

Connection refused errors

Verify the queen is running:


curl http://localhost:7833/health

Check firewall settings if using a custom port.

Out of memory errors

The model may be too large for your GPU. Try:

A smaller model (llama-3.2-1b instead of llama-3.1-70b)
CPU inference (slower but no VRAM limit)
Quantized models (4-bit or 8-bit versions)