Skip to content
Skip to Content
TroubleshootingCommon Issues

Common Issues & Troubleshooting

Comprehensive troubleshooting guide for rbee deployment and operation issues.

Quick Diagnostics

Health Check All Services

# Check Queen curl http://localhost:7833/health # Check Hive curl http://localhost:7835/health # Check Worker (if running - port assigned dynamically by hive) # Example: curl http://localhost:8080/health # Use 'ps aux | grep worker' to find actual port

Check Service Status

# Check if services are running ps aux | grep -E "(queen-rbee|rbee-hive)" # Check listening ports ss -tlnp | grep -E "(7833|7835|8080|8081|8082)" # Check systemd services sudo systemctl status queen-rbee sudo systemctl status rbee-hive

Connection Issues

Queen Won’t Start

Symptom: Queen fails to start or exits immediately

Common Causes:

  1. Port already in use

    # Check what's using port 7833 lsof -i :7833 # Kill the process kill $(lsof -t -i:7833) # Or use a different port queen-rbee --port 8080
  2. Permission denied

    # Don't run as root queen-rbee --port 7833 # If using systemd, check service user sudo systemctl status queen-rbee
  3. Missing dependencies

    # Check binary ldd $(which queen-rbee) # Rebuild if needed cargo build --release --bin queen-rbee

Hive Can’t Connect to Queen

Symptom: Hive starts but doesn’t appear in Queen’s telemetry

Solutions:

  1. Verify Queen is reachable

    # From hive machine curl http://localhost:7833/health # If remote curl http://queen-host:7833/health
  2. Check Queen URL configuration

    # Ensure correct Queen URL rbee-hive --queen-url http://localhost:7833 --hive-id my-hive # For remote Queen rbee-hive --queen-url http://queen-host:7833 --hive-id my-hive
  3. Check firewall rules

    # On Queen machine, allow port 7833 sudo ufw allow 7833/tcp # Test connectivity telnet queen-host 7833
  4. Check hive discovery logs

    # Look for discovery errors sudo journalctl -u rbee-hive | grep discovery

Worker Issues

Worker Won’t Spawn

Symptom: Worker spawn job fails or worker exits immediately

Common Causes:

  1. Worker binary not found

    # Check if worker binary exists which llm-worker-cpu which llm-worker-cuda which llm-worker-metal # Install if missing cargo build --release --bin llm-worker-cpu
  2. GPU not available (CUDA workers)

    # Check GPU status nvidia-smi # Check CUDA installation nvcc --version # Check device permissions ls -l /dev/nvidia*
  3. Model not found

    # Check model catalog ls ~/.cache/rbee/models/ # Download model first # (Use Hive API to download model)
  4. Port already in use

    # Workers use dynamic ports (8080-9999) # Hive automatically assigns available ports # Check what's using a specific port lsof -i :8080

Worker Crashes After Spawn

Symptom: Worker starts but crashes during inference

Solutions:

  1. Out of memory

    # Check available RAM/VRAM free -h nvidia-smi # For GPU memory # Use smaller model or reduce batch size
  2. Model file corrupted

    # Remove and re-download model rm ~/.cache/rbee/models/model-name.gguf # Re-download via Hive API
  3. Check worker logs

    # Worker logs go to stdout # If using systemd (port assigned dynamically): # sudo journalctl -u llm-worker@<PORT> # Find worker PID: ps aux | grep worker

Model Download Issues

Download Fails or Hangs

Symptom: Model download doesn’t complete

Solutions:

  1. Check network connectivity

    # Test HuggingFace access curl -I https://huggingface.co # Check DNS nslookup huggingface.co
  2. Check disk space

    # Models can be large (1-50GB) df -h ~/.cache/rbee/ # Clean up old models rm ~/.cache/rbee/models/*.gguf
  3. Check download progress

    # Monitor download directory watch -n 1 'ls -lh ~/.cache/rbee/models/' # Check Hive logs sudo journalctl -u rbee-hive -f
  4. Retry download

    # Model provisioner supports resume # Just retry the download operation

GPU Detection Issues

CUDA Workers Can’t Find GPU

Symptom: CUDA worker fails with “no CUDA devices found”

Solutions:

  1. Verify GPU is detected

    # Check GPU status nvidia-smi # Check CUDA version nvcc --version
  2. Check NVIDIA driver

    # Check driver version nvidia-smi | head -n 3 # Update driver if needed sudo ubuntu-drivers autoinstall # Ubuntu
  3. Check device permissions

    # Add user to video group sudo usermod -a -G video $USER # Logout and login for changes to take effect
  4. Check CUDA library path

    # Ensure CUDA libraries are in LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH # Add to ~/.bashrc for persistence

Metal Workers (macOS)

Symptom: Metal worker fails on macOS

Solutions:

  1. Check Metal support

    # Metal requires macOS 10.13+ system_profiler SPSoftwareDataType | grep "System Version"
  2. Check GPU availability

    # List GPUs system_profiler SPDisplaysDataType | grep "Chipset Model"

Performance Issues

Slow Inference

Symptom: Inference takes longer than expected

Solutions:

  1. Check resource usage

    # Monitor CPU/GPU usage htop nvidia-smi -l 1 # For GPU
  2. Check model size vs hardware

    # Ensure model fits in VRAM/RAM # Use smaller quantization (Q4 instead of Q8)
  3. Check for CPU throttling

    # Check CPU frequency watch -n 1 'cat /proc/cpuinfo | grep MHz'

High Memory Usage

Symptom: System runs out of memory

Solutions:

  1. Monitor memory usage

    # Check memory free -h # Check per-process memory ps aux --sort=-%mem | head
  2. Reduce worker count

    # Spawn fewer workers # Each worker loads model into memory
  3. Use smaller models

    # Use quantized models (Q4_K_M instead of F16) # Use smaller parameter counts (7B instead of 13B)

Network Issues

SSE Streams Not Working

Symptom: Heartbeat or job streams don’t receive events

Solutions:

  1. Check reverse proxy configuration

    # Nginx needs these settings for SSE proxy_buffering off; proxy_cache off; proxy_set_header Connection '';
  2. Test SSE directly

    # Bypass proxy curl -N http://localhost:7833/v1/heartbeats/stream
  3. Check firewall

    # Ensure long-lived connections are allowed sudo ufw status

CORS Errors in Browser

Symptom: Browser console shows CORS errors

Solutions:

  1. Check CORS configuration

    # Queen/Hive include CORS middleware # Should work by default
  2. Use reverse proxy

    # Add CORS headers in nginx add_header Access-Control-Allow-Origin *; add_header Access-Control-Allow-Methods "GET, POST, DELETE";

Build Issues

Compilation Fails

Symptom: cargo build fails

Solutions:

  1. Update Rust toolchain

    rustup update rustup default stable
  2. Clean build cache

    cargo clean cargo build --release
  3. Check dependencies

    # Install system dependencies sudo apt install build-essential pkg-config libssl-dev # Ubuntu

Catalog Issues

Catalog Initialization Failed

Symptom: Hive fails to start with catalog error

Solutions:

  1. Check catalog directory

    ls -la ~/.cache/rbee/ # Should contain: # - models.db # - workers.db
  2. Recreate catalogs

    # Remove corrupted databases rm ~/.cache/rbee/*.db # Restart hive (will recreate) rbee-hive
  3. Check permissions

    # Ensure user can write to cache directory chmod 755 ~/.cache/rbee/

Getting Help

Collect Diagnostic Information

# System info uname -a # Service status systemctl status queen-rbee rbee-hive # Logs sudo journalctl -u queen-rbee -n 100 sudo journalctl -u rbee-hive -n 100 # Network (workers use dynamic ports 8080+) ss -tlnp | grep -E "(7833|7835|8080|8081|8082)" # GPU (if applicable) nvidia-smi # Disk space df -h ~/.cache/rbee/

Completed by: TEAM-427
Based on: Production deployment experience and codebase analysis

2025 © rbee. Your private AI cloud, in one command.
GitHubrbee.dev