Performance Tuning

Optimize rbee for maximum performance and efficiency in production deployments.

Hardware Selection

CPU Workers

Best for:

Development and testing
Low-latency requirements
Universal compatibility

Requirements:

Modern CPU (4+ cores recommended)
8GB+ RAM per worker
Fast storage (SSD recommended)

CUDA Workers

Best for:

Production inference
High throughput
Large models

Requirements:

NVIDIA GPU (8GB+ VRAM)
CUDA 11.8+ or 12.x
PCIe 3.0 x16 or better

Metal Workers

Best for:

Apple Silicon (M1/M2/M3)
macOS development

Requirements:

macOS 10.13+
Apple Silicon or AMD GPU

Model Selection

Quantization

Choose appropriate quantization for your use case:

Quantization	Size	Quality	Speed	Use Case
F16	Largest	Best	Slowest	Research, evaluation
Q8_0	Large	Excellent	Slow	High-quality production
Q4_K_M	Medium	Good	Fast	Recommended for production
Q4_0	Small	Acceptable	Fastest	Resource-constrained

Recommendation: Use Q4_K_M for best balance of quality and performance.

Model Size

Parameters	VRAM (Q4)	RAM (CPU)	Tokens/sec (RTX 3090)
7B	4-5 GB	8 GB	80-100
13B	8-9 GB	16 GB	40-50
34B	20-22 GB	40 GB	15-20
70B	40-45 GB	80 GB	8-10

Resource Limits

Memory Management

Monitor memory usage:


# System memory
free -h
 
# GPU memory
nvidia-smi
 
# Per-process memory
ps aux --sort=-%mem | head

Best practices:

Leave 2GB system RAM free
Leave 1GB VRAM free for OS
Use swap only as emergency buffer

CPU Allocation

Single worker:


# Dedicate 4 cores to worker
taskset -c 0-3 llm-worker-cpu --port 8080

Multiple workers:


# Worker 1: cores 0-3
taskset -c 0-3 llm-worker-cpu --port 8080 &
 
# Worker 2: cores 4-7
taskset -c 4-7 llm-worker-cpu --port 8081 &

Monitoring

Real-Time Telemetry

Watch heartbeat stream:


curl -N http://localhost:7833/v1/heartbeats/stream

Monitor job progress:


curl -N http://localhost:7833/v1/jobs/{job_id}/stream

System Monitoring

CPU and memory:


htop

GPU monitoring:


# Continuous monitoring
nvidia-smi -l 1
 
# Detailed stats
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free --format=csv -l 1

Disk I/O:


iostat -x 1

Network Optimization

Localhost Communication

Use localhost for same-machine communication:


# Faster than 0.0.0.0
queen-rbee --port 7833  # Binds to 0.0.0.0 but prefer localhost clients
rbee-hive --queen-url http://localhost:7833

SSE Connection Pooling

Limit concurrent SSE connections:

Each SSE stream holds a connection open
Limit to 10-20 concurrent streams per client
Use connection pooling in clients

Reverse Proxy

Use nginx for connection pooling:


upstream queen {
    server 127.0.0.1:7833;
    keepalive 32;
}
 
server {
    location / {
        proxy_pass http://queen;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}

Inference Optimization

Batch Processing

Process multiple requests together:

Increases throughput
Reduces per-request latency
Better GPU utilization

Trade-off: Higher latency for individual requests

Context Caching

Reuse KV cache for similar prompts:

Faster subsequent requests
Reduced memory usage
Better for chat applications

Parallel Decoding

Use multiple workers for same model:


# Spawn 3 workers with same model
# Distribute requests via Queen

Storage Optimization

Model Storage

Use fast storage for models:


# Move models to NVMe SSD
mkdir -p /mnt/nvme/rbee/models
ln -s /mnt/nvme/rbee/models ~/.cache/rbee/models

Deduplicate models:


# Use hard links for duplicate models
# (Not implemented yet - manual for now)

Database Optimization

SQLite tuning:


-- Add to catalog initialization
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA cache_size = 10000;
PRAGMA temp_store = MEMORY;

Benchmarking

Measure Throughput

Tokens per second:


# Run inference and measure
time echo "Tell me a story" | curl -X POST \
  http://localhost:8080/v1/infer \
  -d @- | wc -w

Measure Latency

Time to first token:


# Measure TTFT
time curl -N http://localhost:8080/v1/infer \
  -d '{"prompt": "Hello"}' \
  | head -n 1

Load Testing

Use Apache Bench:


# 100 requests, 10 concurrent
ab -n 100 -c 10 -p request.json \
  -T application/json \
  http://localhost:7833/v1/jobs

Production Checklist

Use Q4_K_M quantization for models
Allocate sufficient VRAM/RAM per worker
Monitor system resources (CPU, memory, GPU)
Use SSD for model storage
Configure reverse proxy for TLS
Set up monitoring and alerting
Test under expected load
Document performance baselines
Plan for scaling (add more hives)

Worker Types

Hardware selection guide

Heartbeat Architecture

Telemetry system

Troubleshooting

Performance issues

Completed by: TEAM-427
Based on: Production deployment best practices

Performance Tuning

Hardware Selection

CPU Workers

CUDA Workers

Metal Workers

Model Selection

Quantization

Model Size

Resource Limits

Memory Management

CPU Allocation

Monitoring

Real-Time Telemetry

System Monitoring

Network Optimization

Localhost Communication

SSE Connection Pooling

Reverse Proxy

Inference Optimization

Batch Processing

Context Caching

Parallel Decoding

Storage Optimization

Model Storage

Database Optimization

Benchmarking

Measure Throughput

Measure Latency

Load Testing

Production Checklist

Related Documentation

Worker Types

Heartbeat Architecture

Troubleshooting