Single Machine Setup
This guide walks you through setting up rbee on a single machine. This is the simplest configuration and perfect for getting started, testing, or running AI workloads on one computer with one or more GPUs.
What youβll build
By the end of this guide, youβll have:
- A running rbee colony on one machine
- The keeper GUI connected to your local queen
- At least one worker running an LLM model
- The ability to send requests through the OpenAI-compatible API
Prerequisites
- rbee installed (see Installation)
- At least one GPU (or CPU for testing)
- 16GB+ RAM recommended
- 20GB+ free disk space for models
Step 1: Start the queen
The queen is the orchestrator that manages your colony. On a single machine, it runs locally:
# Start the queen on default port (7833)
rbee queen start
# Note: Port configuration is handled via queen-rbee daemon args
# See architecture docs for advanced configurationThe queen will:
- Start an HTTP API server
- Initialize the job registry
- Wait for hives to connect
You should see output like:
π Queen started on http://localhost:7833
π Waiting for hives to register...Step 2: Start the hive
The hive runs on the same machine and hosts workers:
# Start the hive (auto-detects queen on localhost)
rbee hive start
# Or specify the queen URL explicitly
rbee hive start --host localhostThe hive will:
- Detect available GPUs
- Register with the queen
- Start sending heartbeats with capability information
You should see:
π Hive started
π Detected GPUs: NVIDIA RTX 3090 (24GB VRAM)
π Registered with queen at http://localhost:7833Step 3: Download a model
Before spawning a worker, you need a model. rbee can download models from HuggingFace:
# Download a small model for testing (1.3GB)
rbee model download llama-3.2-1b
# Or a larger, more capable model (8GB)
rbee model download llama-3.1-8bModels are stored in ~/.cache/rbee/models/ and shared across all workers.
Step 4: Spawn a worker
Now spawn a worker to run the model:
# Spawn an LLM worker (CUDA)
rbee worker spawn \
--model llama-3.2-1b \
--worker cuda \
--device 0
# For CPU-only (slower)
rbee worker spawn \
--model llama-3.2-1b \
--worker cpu \
--device 0The worker will:
- Load the model into GPU/CPU memory
- Start an inference server
You should see:
π Worker spawned: worker-abc123
π Model: llama-3.2-1b
π Device: cuda:0
π Status: readyStep 5: Send a test request
Now test the system with a chat completion request:
curl -X POST http://localhost:7833/v1/chat/completions \\
-H "Content-Type: application/json" \\
-d '{
"model": "llama-3.2-1b",
"messages": [
{"role": "user", "content": "Explain what rbee does in one sentence."}
]
}'You should get a streaming response with the modelβs answer.
Step 6: Open the keeper GUI (optional)
For a visual interface, start the keeper:
rbee-keeperThe keeper GUI will open and automatically connect to your local queen. Youβll see:
- Your hive with GPU information
- Active workers and their status
- Real-time GPU utilization
- A chat interface to interact with models
Verify your setup
Check that everything is running:
# List all workers
rbee worker list
# Check worker status (view details)
rbee status
# View hive status
rbee hive status --host localhostWhat youβve built
You now have a complete single-machine rbee colony:
βββββββββββββββββββββββββββββββββββββββ
β Your Machine β
β β
β ββββββββββββ β
β β Keeper β (GUI) β
β ββββββ¬ββββββ β
β β β
β ββββββΌββββββ β
β β Queen β (Orchestrator) β
β ββββββ¬ββββββ β
β β β
β ββββββΌββββββ β
β β Hive β β
β β β β
β β βββββββββ β
β β βWorkerββ (LLM on GPU) β
β β βββββββββ β
β ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββNext steps
- Add more workers - Run multiple models simultaneously
- Scale to multiple machines - Connect other computers
- Use the API - Integrate with applications
- Monitor performance - Track GPU usage and throughput
Troubleshooting
Worker fails to spawn
Check GPU availability:
nvidia-smi # For NVIDIA GPUsEnsure the model is downloaded:
rbee model listConnection refused errors
Verify the queen is running:
curl http://localhost:7833/healthCheck firewall settings if using a custom port.
Out of memory errors
The model may be too large for your GPU. Try:
- A smaller model (llama-3.2-1b instead of llama-3.1-70b)
- CPU inference (slower but no VRAM limit)
- Quantized models (4-bit or 8-bit versions)