Common Issues & Troubleshooting
Comprehensive troubleshooting guide for rbee deployment and operation issues.
For security-related issues, see the Security Configuration page.
Quick Diagnostics
Health Check All Services
# Check Queen
curl http://localhost:7833/health
# Check Hive
curl http://localhost:7835/health
# Check Worker (if running - port assigned dynamically by hive)
# Example: curl http://localhost:8080/health
# Use 'ps aux | grep worker' to find actual portCheck Service Status
# Check if services are running
ps aux | grep -E "(queen-rbee|rbee-hive)"
# Check listening ports
ss -tlnp | grep -E "(7833|7835|8080|8081|8082)"
# Check systemd services
sudo systemctl status queen-rbee
sudo systemctl status rbee-hiveConnection Issues
Queen Won’t Start
Symptom: Queen fails to start or exits immediately
Common Causes:
-
Port already in use
# Check what's using port 7833 lsof -i :7833 # Kill the process kill $(lsof -t -i:7833) # Or use a different port queen-rbee --port 8080 -
Permission denied
# Don't run as root queen-rbee --port 7833 # If using systemd, check service user sudo systemctl status queen-rbee -
Missing dependencies
# Check binary ldd $(which queen-rbee) # Rebuild if needed cargo build --release --bin queen-rbee
Hive Can’t Connect to Queen
Symptom: Hive starts but doesn’t appear in Queen’s telemetry
Solutions:
-
Verify Queen is reachable
# From hive machine curl http://localhost:7833/health # If remote curl http://queen-host:7833/health -
Check Queen URL configuration
# Ensure correct Queen URL rbee-hive --queen-url http://localhost:7833 --hive-id my-hive # For remote Queen rbee-hive --queen-url http://queen-host:7833 --hive-id my-hive -
Check firewall rules
# On Queen machine, allow port 7833 sudo ufw allow 7833/tcp # Test connectivity telnet queen-host 7833 -
Check hive discovery logs
# Look for discovery errors sudo journalctl -u rbee-hive | grep discovery
Worker Issues
Worker Won’t Spawn
Symptom: Worker spawn job fails or worker exits immediately
Common Causes:
-
Worker binary not found
# Check if worker binary exists which llm-worker-cpu which llm-worker-cuda which llm-worker-metal # Install if missing cargo build --release --bin llm-worker-cpu -
GPU not available (CUDA workers)
# Check GPU status nvidia-smi # Check CUDA installation nvcc --version # Check device permissions ls -l /dev/nvidia* -
Model not found
# Check model catalog ls ~/.cache/rbee/models/ # Download model first # (Use Hive API to download model) -
Port already in use
# Workers use dynamic ports (8080-9999) # Hive automatically assigns available ports # Check what's using a specific port lsof -i :8080
Worker Crashes After Spawn
Symptom: Worker starts but crashes during inference
Solutions:
-
Out of memory
# Check available RAM/VRAM free -h nvidia-smi # For GPU memory # Use smaller model or reduce batch size -
Model file corrupted
# Remove and re-download model rm ~/.cache/rbee/models/model-name.gguf # Re-download via Hive API -
Check worker logs
# Worker logs go to stdout # If using systemd (port assigned dynamically): # sudo journalctl -u llm-worker@<PORT> # Find worker PID: ps aux | grep worker
Model Download Issues
Download Fails or Hangs
Symptom: Model download doesn’t complete
Solutions:
-
Check network connectivity
# Test HuggingFace access curl -I https://huggingface.co # Check DNS nslookup huggingface.co -
Check disk space
# Models can be large (1-50GB) df -h ~/.cache/rbee/ # Clean up old models rm ~/.cache/rbee/models/*.gguf -
Check download progress
# Monitor download directory watch -n 1 'ls -lh ~/.cache/rbee/models/' # Check Hive logs sudo journalctl -u rbee-hive -f -
Retry download
# Model provisioner supports resume # Just retry the download operation
GPU Detection Issues
CUDA Workers Can’t Find GPU
Symptom: CUDA worker fails with “no CUDA devices found”
Solutions:
-
Verify GPU is detected
# Check GPU status nvidia-smi # Check CUDA version nvcc --version -
Check NVIDIA driver
# Check driver version nvidia-smi | head -n 3 # Update driver if needed sudo ubuntu-drivers autoinstall # Ubuntu -
Check device permissions
# Add user to video group sudo usermod -a -G video $USER # Logout and login for changes to take effect -
Check CUDA library path
# Ensure CUDA libraries are in LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH # Add to ~/.bashrc for persistence
Metal Workers (macOS)
Symptom: Metal worker fails on macOS
Solutions:
-
Check Metal support
# Metal requires macOS 10.13+ system_profiler SPSoftwareDataType | grep "System Version" -
Check GPU availability
# List GPUs system_profiler SPDisplaysDataType | grep "Chipset Model"
Performance Issues
Slow Inference
Symptom: Inference takes longer than expected
Solutions:
-
Check resource usage
# Monitor CPU/GPU usage htop nvidia-smi -l 1 # For GPU -
Check model size vs hardware
# Ensure model fits in VRAM/RAM # Use smaller quantization (Q4 instead of Q8) -
Check for CPU throttling
# Check CPU frequency watch -n 1 'cat /proc/cpuinfo | grep MHz'
High Memory Usage
Symptom: System runs out of memory
Solutions:
-
Monitor memory usage
# Check memory free -h # Check per-process memory ps aux --sort=-%mem | head -
Reduce worker count
# Spawn fewer workers # Each worker loads model into memory -
Use smaller models
# Use quantized models (Q4_K_M instead of F16) # Use smaller parameter counts (7B instead of 13B)
Network Issues
SSE Streams Not Working
Symptom: Heartbeat or job streams don’t receive events
Solutions:
-
Check reverse proxy configuration
# Nginx needs these settings for SSE proxy_buffering off; proxy_cache off; proxy_set_header Connection ''; -
Test SSE directly
# Bypass proxy curl -N http://localhost:7833/v1/heartbeats/stream -
Check firewall
# Ensure long-lived connections are allowed sudo ufw status
CORS Errors in Browser
Symptom: Browser console shows CORS errors
Solutions:
-
Check CORS configuration
# Queen/Hive include CORS middleware # Should work by default -
Use reverse proxy
# Add CORS headers in nginx add_header Access-Control-Allow-Origin *; add_header Access-Control-Allow-Methods "GET, POST, DELETE";
Build Issues
Compilation Fails
Symptom: cargo build fails
Solutions:
-
Update Rust toolchain
rustup update rustup default stable -
Clean build cache
cargo clean cargo build --release -
Check dependencies
# Install system dependencies sudo apt install build-essential pkg-config libssl-dev # Ubuntu
Catalog Issues
Catalog Initialization Failed
Symptom: Hive fails to start with catalog error
Solutions:
-
Check catalog directory
ls -la ~/.cache/rbee/ # Should contain: # - models.db # - workers.db -
Recreate catalogs
# Remove corrupted databases rm ~/.cache/rbee/*.db # Restart hive (will recreate) rbee-hive -
Check permissions
# Ensure user can write to cache directory chmod 755 ~/.cache/rbee/
Getting Help
Collect Diagnostic Information
# System info
uname -a
# Service status
systemctl status queen-rbee rbee-hive
# Logs
sudo journalctl -u queen-rbee -n 100
sudo journalctl -u rbee-hive -n 100
# Network (workers use dynamic ports 8080+)
ss -tlnp | grep -E "(7833|7835|8080|8081|8082)"
# GPU (if applicable)
nvidia-smi
# Disk space
df -h ~/.cache/rbee/Related Documentation
Queen Configuration
Configure queen-rbee
Hive Configuration
Configure rbee-hive
Security Configuration
Authentication and security
Completed by: TEAM-427
Based on: Production deployment experience and codebase analysis