Premium Modules
Premium modules add production-grade features to rbee for businesses running AI infrastructure as a product. This page provides a detailed comparison and feature breakdown.
Note: All premium features are clearly marked throughout the documentation with a Premium badge.
Feature comparison matrix
| Feature | Open Source | Premium Queen | Premium Worker | GDPR Auditing |
|---|---|---|---|---|
| Routing & Scheduling | ||||
| Basic round-robin routing | ✅ | ✅ | ✅ | ✅ |
| Weighted least-loaded routing | ❌ | ✅ | - | - |
| Latency-aware routing | ❌ | ✅ | - | - |
| Priority queue support | ❌ | ✅ | ✅ | - |
| Automatic failover | ❌ | ✅ | - | - |
| Custom routing policies | ❌ | ✅ | - | - |
| Quotas & Limits | ||||
| Basic rate limiting | ❌ | ✅ | - | - |
| Per-customer quotas | ❌ | ✅ | - | - |
| Per-tier quotas | ❌ | ✅ | - | - |
| Token-based billing | ❌ | ✅ | ✅ | - |
| Time-based billing | ❌ | ✅ | ✅ | - |
| Telemetry & Metrics | ||||
| Basic worker status | ✅ | ✅ | ✅ | ✅ |
| Detailed performance metrics | ❌ | ✅ | ✅ | - |
| Prometheus export | ❌ | ✅ | ✅ | - |
| Per-request metrics | ❌ | - | ✅ | - |
| Cost tracking | ❌ | ✅ | ✅ | - |
| Compliance & Auditing | ||||
| Basic request logging | ✅ | ✅ | ✅ | ✅ |
| Complete audit trail | ❌ | - | - | ✅ |
| Data lineage tracking | ❌ | - | - | ✅ |
| Right-to-erasure support | ❌ | - | - | ✅ |
| Automated compliance reports | ❌ | - | - | ✅ |
| 7-year log retention | ❌ | - | - | ✅ |
| Multi-tenancy | ||||
| Single-user mode | ✅ | ✅ | ✅ | ✅ |
| Multi-user support | ❌ | ✅ | - | ✅ |
| User authentication | ❌ | ✅ | - | ✅ |
| API key management | ❌ | ✅ | - | - |
| Tenant isolation | ❌ | ✅ | - | ✅ |
| Performance | ||||
| Basic batching | ✅ | ✅ | ✅ | ✅ |
| Advanced batching | ❌ | - | ✅ | - |
| Memory optimization | ❌ | - | ✅ | - |
| Custom model loading | ❌ | - | ✅ | - |
Premium Queen
Premium Queen replaces the open source queen with production-grade orchestration features.
Advanced routing
Weighted least-loaded routing: Distributes requests based on current worker load, weighted by GPU capability.
# Configure weighted routing
premium-queen routing set-strategy weighted-least-loaded \\
--weight-by-vram true \\
--weight-by-compute trueLatency-aware routing: Routes requests to workers with lowest historical latency.
# Enable latency-aware routing
premium-queen routing set-strategy latency-aware \\
--latency-window-seconds 300 \\
--prefer-local-hives trueCustom routing policies: Define complex routing rules based on model, customer tier, time of day, etc.
# Route expensive models to premium GPUs
premium-queen routing add-rule \\
--model-pattern "llama-3.1-405b" \\
--hive-tier premium,ultra \\
--priority high
# Route free-tier customers to budget GPUs
premium-queen routing add-rule \\
--customer-tier free \\
--hive-tier budget \\
--max-concurrent-requests 5Quota management
Per-customer quotas:
# Set quota for a specific customer
premium-queen quota set \\
--customer acme-corp \\
--max-requests-per-minute 100 \\
--max-tokens-per-day 1000000 \\
--max-concurrent-requests 10Per-tier quotas:
# Set quota for free tier
premium-queen quota set \\
--tier free \\
--max-requests-per-minute 10 \\
--max-tokens-per-day 10000
# Set quota for paid tier
premium-queen quota set \\
--tier paid \\
--max-requests-per-minute 1000 \\
--max-tokens-per-day 10000000Quota enforcement:
- Requests exceeding quota return HTTP 429 (Too Many Requests)
- Quota resets automatically (per-minute, per-hour, per-day)
- Soft limits (warnings) and hard limits (rejections)
Automatic failover
When a worker crashes or becomes unresponsive:
# Configure failover
premium-queen routing set-failover \\
--enable true \\
--retry-attempts 3 \\
--retry-delay-ms 1000 \\
--fallback-to-cpu falsePremium Queen will:
- Detect worker failure (timeout or error)
- Retry on another worker with same model
- Queue request if no workers available
- Return error after max retries
Billing integration
Token-based billing:
# Set token prices per tier
premium-queen billing set-rate \\
--tier budget \\
--price-per-million-tokens 0.50
premium-queen billing set-rate \\
--tier premium \\
--price-per-million-tokens 2.00
# Export billing data
premium-queen billing export \\
--period 2024-01 \\
--format csv \\
--output billing-jan-2024.csvTime-based billing:
# Set GPU-hour prices
premium-queen billing set-rate \\
--tier budget \\
--price-per-gpu-hour 0.50
# View cost breakdown
premium-queen billing report \\
--customer acme-corp \\
--period last-monthTelemetry
Prometheus metrics export:
# Enable Prometheus endpoint
premium-queen metrics enable-prometheus \\
--port 9090 \\
--path /metricsExported metrics include:
rbee_requests_total- Total requests by model, customer, statusrbee_request_duration_seconds- Request latency histogramrbee_tokens_generated_total- Total tokens by model, customerrbee_worker_utilization- Worker busy percentagerbee_quota_remaining- Remaining quota by customer, tier
Real-time dashboard:
# Open built-in dashboard
premium-queen metrics dashboardWeb UI showing:
- Real-time request rate
- Worker utilization across all hives
- Cost breakdown by customer
- Quota usage
- Error rates
Premium Worker
Premium Worker replaces the open source worker with performance-optimized inference.
Advanced batching
Dynamic batching: Automatically batch multiple requests for better GPU utilization.
# Spawn Premium Worker with batching
premium-queen worker spawn \\
--hive gpu-01 \\
--model llama-3.1-70b \\
--worker-type premium \\
--max-batch-size 32 \\
--batch-timeout-ms 100Benefits:
- 2-5x higher throughput on same GPU
- Lower latency for batched requests
- Automatic batch size tuning
Memory optimization
Model quantization: Load models in 4-bit or 8-bit precision for lower VRAM usage.
# Spawn worker with 4-bit quantization
premium-queen worker spawn \\
--hive gpu-01 \\
--model llama-3.1-70b \\
--worker-type premium \\
--quantization 4bitKV cache optimization: Efficient key-value cache management for longer contexts.
# Enable KV cache optimization
premium-queen worker spawn \\
--hive gpu-01 \\
--model llama-3.1-70b \\
--worker-type premium \\
--kv-cache-strategy optimized \\
--max-context-length 32768Per-request metrics
Track detailed metrics for every inference request:
# Enable detailed metrics
premium-queen worker spawn \\
--hive gpu-01 \\
--model llama-3.1-70b \\
--worker-type premium \\
--enable-detailed-metricsMetrics include:
- Time to first token (TTFT)
- Tokens per second (TPS)
- GPU memory usage
- Batch size used
- Queue wait time
Priority queue
Process high-priority requests first:
# Send high-priority request
curl -X POST http://localhost:7833/v1/chat/completions \\
-H "X-Priority: high" \\
-H "Content-Type: application/json" \\
-d '{"model": "llama-3.1-70b", "messages": [...]}'Premium Worker will:
- Queue low-priority requests
- Process high-priority requests immediately
- Preempt low-priority requests if needed
GDPR Auditing Module
GDPR Auditing Module adds comprehensive compliance features for EU-based deployments.
Complete audit trail
Every inference request is logged with:
- User identity (if authenticated)
- Timestamp (with timezone)
- Model used
- Input data (optional, configurable)
- Output data (optional, configurable)
- Processing location (which hive/worker)
- Duration and resource usage
# Enable audit logging
premium-queen audit enable \\
--log-level detailed \\
--log-requests true \\
--log-responses true \\
--retention-days 2555 # 7 years for GDPRData lineage tracking
Track where data has been processed:
# View data lineage for a request
premium-queen audit lineage --request-id req-abc-123Output shows:
- Which hive processed the request
- Which worker ran the inference
- Which GPU was used
- Data transformations applied
- Timestamps for each step
Right-to-erasure support
Comply with GDPR Article 17 (right to be forgotten):
# Delete all data for a user
premium-queen audit erase-user-data \\
--user alice.smith \\
--confirm
# Delete specific requests
premium-queen audit erase-request \\
--request-id req-abc-123 \\
--confirmAutomated compliance reports
Generate compliance reports automatically:
# Schedule monthly reports
premium-queen audit schedule-report \\
--frequency monthly \\
--format pdf \\
--email compliance@company.com \\
--include-summary true \\
--include-details falseReports include:
- Total requests processed
- Data retention status
- Erasure requests fulfilled
- Processing locations
- User access logs
PII detection
Automatically detect and flag personally identifiable information:
# Enable PII detection
premium-queen audit enable-pii-detection \\
--anonymize-logs true \\
--alert-on-pii trueDetects:
- Email addresses
- Phone numbers
- Social security numbers
- Credit card numbers
- IP addresses
Pricing
Note: Premium modules are planned for M2 launch (target Q2 2026). Pricing shown is the planned structure.
Planned Pricing (M2 Launch)
| Product | Price | Notes |
|---|---|---|
| Premium Queen | €129 | Standalone - Advanced RHAI scheduling |
| GDPR Auditing | €249 | Standalone - Full compliance module |
| Queen + Worker Bundle | €279 | ⭐ MOST POPULAR - Full smart scheduling |
| Queen + Audit Bundle | €349 | Scheduling + compliance |
| Complete Bundle | €499 | ⭐⭐ BEST VALUE - Everything included |
Important: Premium Worker is NOT sold standalone - only available in bundles (requires Premium Queen for telemetry processing).
Discounts
- Academic/research: 50% off
- Open source contributors: 30% off
- Startups (< 2 years old): Free for first year
Purchase (Available M2 Launch)
- Email: sales@rbee.dev
- Include: Use case, number of GPUs, expected scale
- Payment: Bank transfer, crypto, credit card
Next steps
- Licensing - Understand license terms
- GPU provider setup - Deploy with premium modules
- GDPR compliance - Deep dive into auditing features
- Contact sales - Questions or trial request