Skip to content
Skip to Content
Getting StartedGPU Providers & Platforms

GPU Providers & Platforms

This guide is for businesses and individuals who want to turn their GPU fleet into an API product. Learn how to use rbee to expose heterogeneous hardware through one OpenAI-compatible endpoint with production-grade routing and telemetry.

Who this is for

  • GPU rental platforms - Turn spare capacity into revenue
  • AI API providers - Build a competitive inference service
  • ML infrastructure teams - Offer internal AI services to other teams
  • Ex-crypto miners - Monetize idle GPUs from the crypto era

What you’ll build

A production-ready API platform that:

  • Exposes multiple GPU types through one stable endpoint
  • Routes requests based on model, load, and quotas
  • Tracks usage and costs per customer/project
  • Provides detailed telemetry for optimization
  • Handles failures gracefully with automatic retries

Prerequisites

  • Multiple machines with GPUs (or access to cloud GPU instances)
  • rbee installed on all machines (see Installation)
  • Basic understanding of API products and pricing

Note on Premium Features: This guide describes using Premium Queen and Premium Worker for production deployments. These modules are planned for M2 launch (target Q2 2026). The current M0 release supports basic multi-machine orchestration with manual routing. Premium features (advanced routing, quotas, telemetry, billing) will be available in M2.

Architecture for GPU providers

┌─────────────────┐ │ Your Customers │ └────────┬────────┘ ┌────────▼────────┐ │ Load Balancer │ (Optional: Cloudflare, nginx) └────────┬────────┘ ┌────────▼────────┐ │ Premium Queen │ (Routing, quotas, telemetry) └────────┬────────┘ ┌────────────────────┼────────────────────┐ │ │ │ ┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐ │ Hive 1 │ │ Hive 2 │ │ Hive 3 │ │ RTX 4090 │ │ A100 80GB│ │ H100 SXM │ │ │ │ │ │ │ │ ┌──────┐ │ │ ┌──────┐ │ │ ┌──────┐ │ │ │Worker│ │ │ │Worker│ │ │ │Worker│ │ │ └──────┘ │ │ └──────┘ │ │ └──────┘ │ └──────────┘ └──────────┘ └──────────┘

Step 1: Plan your GPU fleet

Inventory your hardware and decide on pricing tiers:

Example fleet configuration

TierGPU TypeVRAMModels SupportedPrice/1M tokens
BudgetRTX 409024GBUp to 70B params$0.50
StandardA100 40GB40GBUp to 70B params$1.00
PremiumA100 80GB80GBUp to 405B params$2.00
UltraH100 SXM80GBAny model$4.00

Step 2: Set up the queen with Premium features

M2 Planned: Premium Queen will add production-grade routing and quotas.

Current M0 setup:

# Start queen (M0 - basic routing only) rbee queen start

Planned M2 capabilities (CLI syntax subject to change):

  • Advanced routing strategies (weighted-least-loaded, latency-aware)
  • Per-customer quotas and rate limiting
  • Detailed telemetry and metrics export
  • Automatic failover and retry logic

Note: Premium Queen is a paid module. See Premium modules for licensing.

Step 3: Configure hive fleet

Create a comprehensive hive catalog at ~/.rbee/hives.conf:

# Budget tier - RTX 4090 fleet [[hive]] alias = "rtx-4090-01" host = "10.0.1.10" ssh_user = "rbee" tier = "budget" cost_per_hour = 0.50 [[hive]] alias = "rtx-4090-02" host = "10.0.1.11" ssh_user = "rbee" tier = "budget" cost_per_hour = 0.50 # Standard tier - A100 40GB [[hive]] alias = "a100-40gb-01" host = "10.0.2.10" ssh_user = "rbee" tier = "standard" cost_per_hour = 1.50 # Premium tier - A100 80GB [[hive]] alias = "a100-80gb-01" host = "10.0.3.10" ssh_user = "rbee" tier = "premium" cost_per_hour = 3.00 # Ultra tier - H100 [[hive]] alias = "h100-01" host = "10.0.4.10" ssh_user = "rbee" tier = "ultra" cost_per_hour = 6.00

Step 4: Install and start all hives

# Install rbee on all hives for hive in rtx-4090-01 rtx-4090-02 a100-40gb-01 a100-80gb-01 h100-01; do premium-queen hive install $hive done # Start all hives for hive in rtx-4090-01 rtx-4090-02 a100-40gb-01 a100-80gb-01 h100-01; do premium-queen hive start $hive done # Verify all hives are connected premium-queen hive list

Step 5: Deploy models across tiers

Deploy appropriate models to each tier:

# Budget tier - small/medium models premium-queen model download llama-3.2-3b --hive rtx-4090-01 premium-queen model download llama-3.1-8b --hive rtx-4090-02 # Standard tier - large models premium-queen model download llama-3.1-70b --hive a100-40gb-01 # Premium tier - very large models premium-queen model download llama-3.1-405b --hive a100-80gb-01 # Ultra tier - any model with maximum performance premium-queen model download llama-3.1-405b --hive h100-01 premium-queen model download stable-diffusion-xl --hive h100-01

Step 6: Spawn workers with Premium Worker

Premium Worker adds advanced telemetry and resource management:

# Spawn Premium Workers on each hive premium-queen worker spawn \\ --hive rtx-4090-01 \\ --model llama-3.1-8b \\ --device cuda:0 \\ --worker-type premium \\ --max-batch-size 32 \\ --enable-metrics # Repeat for other hives...

Note: Premium Worker is a paid module. See Premium modules.

Step 7: Configure routing policies

Set up intelligent routing based on your business logic:

# Route by model size premium-queen routing add-rule \\ --model-pattern "llama-3.2-*" \\ --tier budget premium-queen routing add-rule \\ --model-pattern "llama-3.1-70b" \\ --tier standard,premium premium-queen routing add-rule \\ --model-pattern "llama-3.1-405b" \\ --tier premium,ultra # Route by customer tier premium-queen routing add-rule \\ --customer-tier free \\ --hive-tier budget \\ --max-tokens-per-minute 1000 premium-queen routing add-rule \\ --customer-tier paid \\ --hive-tier standard,premium \\ --max-tokens-per-minute 10000

Step 8: Set up quotas and rate limiting

Protect your infrastructure with quotas:

# Per-customer quotas premium-queen quota set \\ --customer acme-corp \\ --max-requests-per-minute 100 \\ --max-tokens-per-day 1000000 # Per-tier quotas premium-queen quota set \\ --tier free \\ --max-concurrent-requests 5 \\ --max-tokens-per-minute 1000 premium-queen quota set \\ --tier paid \\ --max-concurrent-requests 50 \\ --max-tokens-per-minute 50000

Step 9: Expose the API

Your queen is now an OpenAI-compatible API endpoint:

# Test the endpoint curl -X POST https://api.yourdomain.com/v1/chat/completions \\ -H "Authorization: Bearer your-api-key" \\ -H "Content-Type: application/json" \\ -d '{ "model": "llama-3.1-70b", "messages": [ {"role": "user", "content": "Hello!"} ] }'

Step 10: Monitor and optimize

Use Premium Queen’s telemetry to optimize your fleet:

# View real-time metrics premium-queen metrics dashboard # Export metrics for analysis premium-queen metrics export --format prometheus # View cost breakdown premium-queen billing report --period last-30-days

Production deployment checklist

  • SSL/TLS - Use HTTPS with valid certificates (Let’s Encrypt, Cloudflare)
  • Load balancer - Put nginx or Cloudflare in front of the queen
  • Authentication - Implement API key management (Premium Queen includes this)
  • Monitoring - Set up Prometheus + Grafana for metrics
  • Backups - Regular backups of queen configuration and state
  • Alerting - Alerts for worker failures, quota breaches, high latency
  • DDoS protection - Cloudflare or similar
  • Logging - Centralized logging (ELK stack, Loki, etc.)
  • Documentation - API docs for your customers

Pricing strategies

Token-based pricing

Charge per million tokens (input + output):

# Configure token-based billing premium-queen billing set-rate \\ --tier budget \\ --price-per-million-tokens 0.50 premium-queen billing set-rate \\ --tier premium \\ --price-per-million-tokens 2.00

Time-based pricing

Charge per GPU-hour:

# Configure time-based billing premium-queen billing set-rate \\ --tier budget \\ --price-per-gpu-hour 0.50 premium-queen billing set-rate \\ --tier ultra \\ --price-per-gpu-hour 6.00

Hybrid pricing

Combine both for maximum revenue:

  • Base fee per request
  • Token-based pricing for usage
  • Minimum monthly commitment for enterprise customers

Next steps

Troubleshooting

High latency

Check worker utilization:

premium-queen worker list --show-metrics

Scale horizontally by adding more hives or workers.

Quota breaches

Review quota settings:

premium-queen quota list

Adjust based on actual usage patterns.

Worker failures

Enable automatic failover:

premium-queen routing set-failover \\ --enable \\ --retry-attempts 3 \\ --retry-delay-ms 1000

Cost optimization

Analyze which models/tiers are most profitable:

premium-queen billing analyze --period last-30-days

Adjust pricing or retire unprofitable tiers.

2025 © rbee. Your private AI cloud, in one command.
GitHubrbee.dev