Skip to content
Skip to Content
AdvancedPerformance Tuning

Performance Tuning

Optimize rbee for maximum performance and efficiency in production deployments.

Hardware Selection

CPU Workers

Best for:

  • Development and testing
  • Low-latency requirements
  • Universal compatibility

Requirements:

  • Modern CPU (4+ cores recommended)
  • 8GB+ RAM per worker
  • Fast storage (SSD recommended)

CUDA Workers

Best for:

  • Production inference
  • High throughput
  • Large models

Requirements:

  • NVIDIA GPU (8GB+ VRAM)
  • CUDA 11.8+ or 12.x
  • PCIe 3.0 x16 or better

Metal Workers

Best for:

  • Apple Silicon (M1/M2/M3)
  • macOS development

Requirements:

  • macOS 10.13+
  • Apple Silicon or AMD GPU

Model Selection

Quantization

Choose appropriate quantization for your use case:

QuantizationSizeQualitySpeedUse Case
F16LargestBestSlowestResearch, evaluation
Q8_0LargeExcellentSlowHigh-quality production
Q4_K_MMediumGoodFastRecommended for production
Q4_0SmallAcceptableFastestResource-constrained

Recommendation: Use Q4_K_M for best balance of quality and performance.

Model Size

ParametersVRAM (Q4)RAM (CPU)Tokens/sec (RTX 3090)
7B4-5 GB8 GB80-100
13B8-9 GB16 GB40-50
34B20-22 GB40 GB15-20
70B40-45 GB80 GB8-10

Resource Limits

Memory Management

Monitor memory usage:

# System memory free -h # GPU memory nvidia-smi # Per-process memory ps aux --sort=-%mem | head

Best practices:

  • Leave 2GB system RAM free
  • Leave 1GB VRAM free for OS
  • Use swap only as emergency buffer

CPU Allocation

Single worker:

# Dedicate 4 cores to worker taskset -c 0-3 llm-worker-cpu --port 8080

Multiple workers:

# Worker 1: cores 0-3 taskset -c 0-3 llm-worker-cpu --port 8080 & # Worker 2: cores 4-7 taskset -c 4-7 llm-worker-cpu --port 8081 &

Monitoring

Real-Time Telemetry

Watch heartbeat stream:

curl -N http://localhost:7833/v1/heartbeats/stream

Monitor job progress:

curl -N http://localhost:7833/v1/jobs/{job_id}/stream

System Monitoring

CPU and memory:

htop

GPU monitoring:

# Continuous monitoring nvidia-smi -l 1 # Detailed stats nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free --format=csv -l 1

Disk I/O:

iostat -x 1

Network Optimization

Localhost Communication

Use localhost for same-machine communication:

# Faster than 0.0.0.0 queen-rbee --port 7833 # Binds to 0.0.0.0 but prefer localhost clients rbee-hive --queen-url http://localhost:7833

SSE Connection Pooling

Limit concurrent SSE connections:

  • Each SSE stream holds a connection open
  • Limit to 10-20 concurrent streams per client
  • Use connection pooling in clients

Reverse Proxy

Use nginx for connection pooling:

upstream queen { server 127.0.0.1:7833; keepalive 32; } server { location / { proxy_pass http://queen; proxy_http_version 1.1; proxy_set_header Connection ""; } }

Inference Optimization

Batch Processing

Process multiple requests together:

  • Increases throughput
  • Reduces per-request latency
  • Better GPU utilization

Trade-off: Higher latency for individual requests

Context Caching

Reuse KV cache for similar prompts:

  • Faster subsequent requests
  • Reduced memory usage
  • Better for chat applications

Parallel Decoding

Use multiple workers for same model:

# Spawn 3 workers with same model # Distribute requests via Queen

Storage Optimization

Model Storage

Use fast storage for models:

# Move models to NVMe SSD mkdir -p /mnt/nvme/rbee/models ln -s /mnt/nvme/rbee/models ~/.cache/rbee/models

Deduplicate models:

# Use hard links for duplicate models # (Not implemented yet - manual for now)

Database Optimization

SQLite tuning:

-- Add to catalog initialization PRAGMA journal_mode = WAL; PRAGMA synchronous = NORMAL; PRAGMA cache_size = 10000; PRAGMA temp_store = MEMORY;

Benchmarking

Measure Throughput

Tokens per second:

# Run inference and measure time echo "Tell me a story" | curl -X POST \ http://localhost:8080/v1/infer \ -d @- | wc -w

Measure Latency

Time to first token:

# Measure TTFT time curl -N http://localhost:8080/v1/infer \ -d '{"prompt": "Hello"}' \ | head -n 1

Load Testing

Use Apache Bench:

# 100 requests, 10 concurrent ab -n 100 -c 10 -p request.json \ -T application/json \ http://localhost:7833/v1/jobs

Production Checklist

  • Use Q4_K_M quantization for models
  • Allocate sufficient VRAM/RAM per worker
  • Monitor system resources (CPU, memory, GPU)
  • Use SSD for model storage
  • Configure reverse proxy for TLS
  • Set up monitoring and alerting
  • Test under expected load
  • Document performance baselines
  • Plan for scaling (add more hives)

Completed by: TEAM-427
Based on: Production deployment best practices

2025 © rbee. Your private AI cloud, in one command.
GitHubrbee.dev