Rate Limits

Models Rate Limits

How per-model rate limits work on the Skytells API — concurrency, throughput, and headers.

Models Rate Limits

Inference traffic is limited per model as well as by your account tier. The same account can run different models in parallel; each model’s limits are tracked in separate buckets so a burst on one model does not consume another model’s concurrency or request budget.


How the rate limiter applies per model

When you call POST /v1/predictions (or equivalent inference routes) with a given model id:

  1. Concurrency — The gateway counts how many predictions for that model are actively processing (or otherwise in-flight, per product rules). If you are at the concurrent ceiling for that model, new requests for that model are rejected with 429 until a slot frees up. Other models are unaffected.
  2. Throughput (RPM) — Within a rolling window (see x-skytells-ratelimit-window on Rate limits), the number of requests for that model is capped. Hitting this limit returns 429 even if concurrency slots are available.
  3. Tokens — When your tier includes input/output token budgets, those dimensions are enforced in the same window semantics as requests and appear in x-skytells-ratelimit-limit-tokens-in, x-skytells-ratelimit-remaining-tokens-in, tokens-out, etc., as documented in the overview. Heavy prompts or long completions consume token quota in addition to request quota where applicable.

So: limits are scoped by model (and by account), not by API key alone — all keys on the account share the same account-level buckets; within that, each model has its own concurrency and throughput accounting.

No Yes No Yes POST /v1/predictions Under per-model concurrency? 429 — concurrency Under per-model RPM / token limits? 429 — rate_limit_exceeded Prediction created

Limits vary by model — compute-intensive models (for example video or very large context) may have lower effective ceilings than lighter models within the same tier. Use GET /v1/models and response headers rather than assuming one number for every model.


Checking usage

Read x-skytells-ratelimit-* on 2xx responses — see Response headers (limits and usage). On 429, use Retry-After when present and error.details (reset, window_seconds, metric, etc.) as described under Retry-After (HTTP 429 only).


Reducing prediction volume

If you frequently hit model limits, consider these strategies:

StrategyImpact
Use webhooks instead of pollingEliminates GET /v1/predictions/:id requests entirely
Queue predictions client-sideSubmit new predictions only after the previous one completes
Cache outputs for identical inputsIdentical prompts often produce equivalent results
Use lower-tier models for draftsRun fast/cheap models first, upsample only approved outputs

How is this guide?

On this page