Rate Limits

Models Rate Limits

Per-model concurrency and request-per-minute limits for inference on the Skytells API.

Models Rate Limits

Each AI model has its own concurrency and throughput limits based on the compute resources required to run it. Limits vary by model and account type, and are enforced per account across all API keys.


How Model Limits Work

When you create a prediction, it counts against two separate limits:

  • Concurrent predictions — how many predictions for this model can be actively processing at the same time.
  • Requests per minute (RPM) — how many POST /v1/predictions calls for this model you can make within a rolling 60-second window.
Under Over Under Over POST /v1/predictions Concurrent limit? RPM limit? 429 — concurrency Prediction created 429 — rate limit

Limits by Account Tier

Your rate limits are determined by your account tier, which is based on cumulative monthly spend. All limits apply per account across all API keys.

TierMonthly SpendRequests / min
Tier 1$0 – $10025
Tier 2$100 – $50050
Tier 3$500 – $2,000150
Tier 4$2,000+Higher limits
EnterprisePer contractCustom

Checking Current Usage

The response headers on every prediction endpoint include your current usage against each limit:

X-RateLimit-Limit-RPM: 60
X-RateLimit-Remaining-RPM: 47
X-RateLimit-Limit-Concurrent: 5
X-RateLimit-Remaining-Concurrent: 3
X-RateLimit-Reset: 1741910220
HeaderMeaning
X-RateLimit-Limit-RPMYour RPM ceiling for this model
X-RateLimit-Remaining-RPMRequests remaining in the current window
X-RateLimit-Limit-ConcurrentMax simultaneous active predictions
X-RateLimit-Remaining-ConcurrentAvailable prediction slots right now
X-RateLimit-ResetUnix timestamp when the RPM window resets

Reducing Prediction Volume

If you frequently hit model limits, consider these strategies:

StrategyImpact
Use webhooks instead of pollingEliminates GET /v1/predictions/:id requests entirely
Queue predictions client-sideSubmit new predictions only after the previous one completes
Cache outputs for identical inputsIdentical prompts often produce equivalent results
Use lower-tier models for draftsRun fast/cheap models first, upsample only approved outputs

How is this guide?

On this page