Common Issues & Troubleshooting

Comprehensive troubleshooting guide for rbee deployment and operation issues.

For security-related issues, see the Security Configuration page.

Quick Diagnostics

Health Check All Services


# Check Queen
curl http://localhost:7833/health
 
# Check Hive
curl http://localhost:7835/health
 
# Check Worker (if running - port assigned dynamically by hive)
# Example: curl http://localhost:8080/health
# Use 'ps aux | grep worker' to find actual port

Check Service Status


# Check if services are running
ps aux | grep -E "(queen-rbee|rbee-hive)"
 
# Check listening ports
ss -tlnp | grep -E "(7833|7835|8080|8081|8082)"
 
# Check systemd services
sudo systemctl status queen-rbee
sudo systemctl status rbee-hive

Connection Issues

Queen Won’t Start

Symptom: Queen fails to start or exits immediately

Common Causes:

Port already in use


# Check what's using port 7833
lsof -i :7833
 
# Kill the process
kill $(lsof -t -i:7833)
 
# Or use a different port
queen-rbee --port 8080

Permission denied


# Don't run as root
queen-rbee --port 7833
 
# If using systemd, check service user
sudo systemctl status queen-rbee

Missing dependencies


# Check binary
ldd $(which queen-rbee)
 
# Rebuild if needed
cargo build --release --bin queen-rbee

Hive Can’t Connect to Queen

Symptom: Hive starts but doesn’t appear in Queen’s telemetry

Solutions:

Verify Queen is reachable


# From hive machine
curl http://localhost:7833/health
 
# If remote
curl http://queen-host:7833/health

Check Queen URL configuration


# Ensure correct Queen URL
rbee-hive --queen-url http://localhost:7833 --hive-id my-hive
 
# For remote Queen
rbee-hive --queen-url http://queen-host:7833 --hive-id my-hive

Check firewall rules


# On Queen machine, allow port 7833
sudo ufw allow 7833/tcp
 
# Test connectivity
telnet queen-host 7833

Check hive discovery logs


# Look for discovery errors
sudo journalctl -u rbee-hive | grep discovery

Worker Issues

Worker Won’t Spawn

Symptom: Worker spawn job fails or worker exits immediately

Common Causes:

Worker binary not found


# Check if worker binary exists
which llm-worker-cpu
which llm-worker-cuda
which llm-worker-metal
 
# Install if missing
cargo build --release --bin llm-worker-cpu

GPU not available (CUDA workers)


# Check GPU status
nvidia-smi
 
# Check CUDA installation
nvcc --version
 
# Check device permissions
ls -l /dev/nvidia*

Model not found


# Check model catalog
ls ~/.cache/rbee/models/
 
# Download model first
# (Use Hive API to download model)

Port already in use


# Workers use dynamic ports (8080-9999)
# Hive automatically assigns available ports
# Check what's using a specific port
lsof -i :8080

Worker Crashes After Spawn

Symptom: Worker starts but crashes during inference

Solutions:

Out of memory


# Check available RAM/VRAM
free -h
nvidia-smi  # For GPU memory
 
# Use smaller model or reduce batch size

Model file corrupted


# Remove and re-download model
rm ~/.cache/rbee/models/model-name.gguf
# Re-download via Hive API

Check worker logs


# Worker logs go to stdout
# If using systemd (port assigned dynamically):
# sudo journalctl -u llm-worker@<PORT>
# Find worker PID: ps aux | grep worker

Model Download Issues

Download Fails or Hangs

Symptom: Model download doesn’t complete

Solutions:

Check network connectivity


# Test HuggingFace access
curl -I https://huggingface.co
 
# Check DNS
nslookup huggingface.co

Check disk space


# Models can be large (1-50GB)
df -h ~/.cache/rbee/
 
# Clean up old models
rm ~/.cache/rbee/models/*.gguf

Check download progress


# Monitor download directory
watch -n 1 'ls -lh ~/.cache/rbee/models/'
 
# Check Hive logs
sudo journalctl -u rbee-hive -f

Retry download


# Model provisioner supports resume
# Just retry the download operation

GPU Detection Issues

CUDA Workers Can’t Find GPU

Symptom: CUDA worker fails with “no CUDA devices found”

Solutions:

Verify GPU is detected


# Check GPU status
nvidia-smi
 
# Check CUDA version
nvcc --version

Check NVIDIA driver


# Check driver version
nvidia-smi | head -n 3
 
# Update driver if needed
sudo ubuntu-drivers autoinstall  # Ubuntu

Check device permissions


# Add user to video group
sudo usermod -a -G video $USER
 
# Logout and login for changes to take effect

Check CUDA library path


# Ensure CUDA libraries are in LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
 
# Add to ~/.bashrc for persistence

Metal Workers (macOS)

Symptom: Metal worker fails on macOS

Solutions:

Check Metal support


# Metal requires macOS 10.13+
system_profiler SPSoftwareDataType | grep "System Version"

Check GPU availability


# List GPUs
system_profiler SPDisplaysDataType | grep "Chipset Model"

Performance Issues

Slow Inference

Symptom: Inference takes longer than expected

Solutions:

Check resource usage


# Monitor CPU/GPU usage
htop
nvidia-smi -l 1  # For GPU

Check model size vs hardware


# Ensure model fits in VRAM/RAM
# Use smaller quantization (Q4 instead of Q8)

Check for CPU throttling


# Check CPU frequency
watch -n 1 'cat /proc/cpuinfo | grep MHz'

High Memory Usage

Symptom: System runs out of memory

Solutions:

Monitor memory usage


# Check memory
free -h
 
# Check per-process memory
ps aux --sort=-%mem | head

Reduce worker count


# Spawn fewer workers
# Each worker loads model into memory

Use smaller models


# Use quantized models (Q4_K_M instead of F16)
# Use smaller parameter counts (7B instead of 13B)

Network Issues

SSE Streams Not Working

Symptom: Heartbeat or job streams don’t receive events

Solutions:

Check reverse proxy configuration


# Nginx needs these settings for SSE
proxy_buffering off;
proxy_cache off;
proxy_set_header Connection '';

Test SSE directly


# Bypass proxy
curl -N http://localhost:7833/v1/heartbeats/stream

Check firewall


# Ensure long-lived connections are allowed
sudo ufw status

CORS Errors in Browser

Symptom: Browser console shows CORS errors

Solutions:

Check CORS configuration


# Queen/Hive include CORS middleware
# Should work by default

Use reverse proxy


# Add CORS headers in nginx
add_header Access-Control-Allow-Origin *;
add_header Access-Control-Allow-Methods "GET, POST, DELETE";

Build Issues

Compilation Fails

Symptom: cargo build fails

Solutions:

Update Rust toolchain
```
rustup update
rustup default stable
```
Clean build cache
```
cargo clean
cargo build --release
```

Check dependencies


# Install system dependencies
sudo apt install build-essential pkg-config libssl-dev  # Ubuntu

Catalog Issues

Catalog Initialization Failed

Symptom: Hive fails to start with catalog error

Solutions:

Check catalog directory


ls -la ~/.cache/rbee/
 
# Should contain:
# - models.db
# - workers.db

Recreate catalogs


# Remove corrupted databases
rm ~/.cache/rbee/*.db
 
# Restart hive (will recreate)
rbee-hive

Check permissions


# Ensure user can write to cache directory
chmod 755 ~/.cache/rbee/

Getting Help

Collect Diagnostic Information


# System info
uname -a
 
# Service status
systemctl status queen-rbee rbee-hive
 
# Logs
sudo journalctl -u queen-rbee -n 100
sudo journalctl -u rbee-hive -n 100
 
# Network (workers use dynamic ports 8080+)
ss -tlnp | grep -E "(7833|7835|8080|8081|8082)"
 
# GPU (if applicable)
nvidia-smi
 
# Disk space
df -h ~/.cache/rbee/

Queen Configuration

Configure queen-rbee

Hive Configuration

Configure rbee-hive

Security Configuration

Authentication and security

Completed by: TEAM-427
Based on: Production deployment experience and codebase analysis

Common Issues & Troubleshooting

Quick Diagnostics

Health Check All Services

Check Service Status

Connection Issues

Queen Won’t Start

Hive Can’t Connect to Queen

Worker Issues

Worker Won’t Spawn

Worker Crashes After Spawn

Model Download Issues

Download Fails or Hangs

GPU Detection Issues

CUDA Workers Can’t Find GPU

Metal Workers (macOS)

Performance Issues

Slow Inference

High Memory Usage

Network Issues

SSE Streams Not Working

CORS Errors in Browser

Build Issues

Compilation Fails

Catalog Issues

Catalog Initialization Failed

Getting Help

Collect Diagnostic Information

Related Documentation

Queen Configuration

Hive Configuration

Security Configuration