Models Rate Limits
How per-model rate limits work on the Skytells API — concurrency, throughput, and headers.
Models Rate Limits
Inference traffic is limited per model as well as by your account tier. The same account can run different models in parallel; each model’s limits are tracked in separate buckets so a burst on one model does not consume another model’s concurrency or request budget.
How the rate limiter applies per model
When you call POST /v1/predictions (or equivalent inference routes) with a given model id:
- Concurrency — The gateway counts how many predictions for that model are actively
processing(or otherwise in-flight, per product rules). If you are at the concurrent ceiling for that model, new requests for that model are rejected with429until a slot frees up. Other models are unaffected. - Throughput (RPM) — Within a rolling window (see
x-skytells-ratelimit-windowon Rate limits), the number of requests for that model is capped. Hitting this limit returns429even if concurrency slots are available. - Tokens — When your tier includes input/output token budgets, those dimensions are enforced in the same window semantics as requests and appear in
x-skytells-ratelimit-limit-tokens-in,x-skytells-ratelimit-remaining-tokens-in,tokens-out, etc., as documented in the overview. Heavy prompts or long completions consume token quota in addition to request quota where applicable.
So: limits are scoped by model (and by account), not by API key alone — all keys on the account share the same account-level buckets; within that, each model has its own concurrency and throughput accounting.
Limits vary by model — compute-intensive models (for example video or very large context) may have lower effective ceilings than lighter models within the same tier. Use GET /v1/models and response headers rather than assuming one number for every model.
Checking usage
Read x-skytells-ratelimit-* on 2xx responses — see Response headers (limits and usage). On 429, use Retry-After when present and error.details (reset, window_seconds, metric, etc.) as described under Retry-After (HTTP 429 only).
Reducing prediction volume
If you frequently hit model limits, consider these strategies:
| Strategy | Impact |
|---|---|
| Use webhooks instead of polling | Eliminates GET /v1/predictions/:id requests entirely |
| Queue predictions client-side | Submit new predictions only after the previous one completes |
| Cache outputs for identical inputs | Identical prompts often produce equivalent results |
| Use lower-tier models for drafts | Run fast/cheap models first, upsample only approved outputs |
Creating many predictions simultaneously and polling each one aggressively is the fastest way to exhaust both RPM and concurrency limits. Use webhooks for production workloads.
Related documentation
- Account tiers — spend bands and typical Standard vs Edge ceilings
- Rate limits overview — full header list and client behavior
- Edge rate limits — streaming and Edge-specific behavior
How is this guide?