Models Rate Limits

How per-model rate limits work on the Skytells API — concurrency, throughput, and headers.

Inference traffic is limited per model as well as by your account tier. The same account can run different models in parallel; each model’s limits are tracked in separate buckets so a burst on one model does not consume another model’s concurrency or request budget.

How the rate limiter applies per model

When you call POST /v1/predictions (or equivalent inference routes) with a given model id:

Concurrency — The gateway counts how many predictions for that model are actively processing (or otherwise in-flight, per product rules). If you are at the concurrent ceiling for that model, new requests for that model are rejected with 429 until a slot frees up. Other models are unaffected.
Throughput (RPM) — Within a rolling window (see x-skytells-ratelimit-window on Rate limits), the number of requests for that model is capped. Hitting this limit returns 429 even if concurrency slots are available.
Tokens — When your tier includes input/output token budgets, those dimensions are enforced in the same window semantics as requests and appear in x-skytells-ratelimit-limit-tokens-in, x-skytells-ratelimit-remaining-tokens-in, tokens-out, etc., as documented in the overview. Heavy prompts or long completions consume token quota in addition to request quota where applicable.

So: limits are scoped by model (and by account), not by API key alone — all keys on the account share the same account-level buckets; within that, each model has its own concurrency and throughput accounting.

Limits vary by model — compute-intensive models (for example video or very large context) may have lower effective ceilings than lighter models within the same tier. Use GET /v1/models and response headers rather than assuming one number for every model.

Checking usage

Read x-skytells-ratelimit-* on 2xx responses — see Response headers (limits and usage). On 429, use Retry-After when present and error.details (reset, window_seconds, metric, etc.) as described under Retry-After (HTTP 429 only).

Reducing prediction volume

If you frequently hit model limits, consider these strategies:

Strategy	Impact
Use webhooks instead of polling	Eliminates `GET /v1/predictions/:id` requests entirely
Queue predictions client-side	Submit new predictions only after the previous one completes
Cache outputs for identical inputs	Identical prompts often produce equivalent results
Use lower-tier models for drafts	Run fast/cheap models first, upsample only approved outputs

Creating many predictions simultaneously and polling each one aggressively is the fastest way to exhaust both RPM and concurrency limits. Use webhooks for production workloads.

Account tiers — spend bands and typical Standard vs Edge ceilings
Rate limits overview — full header list and client behavior
Edge rate limits — streaming and Edge-specific behavior

How is this guide?

Models Rate Limits