Models Rate Limits
Per-model concurrency and request-per-minute limits for inference on the Skytells API.
Models Rate Limits
Each AI model has its own concurrency and throughput limits based on the compute resources required to run it. Limits vary by model and account type, and are enforced per account across all API keys.
How Model Limits Work
When you create a prediction, it counts against two separate limits:
- Concurrent predictions — how many predictions for this model can be actively
processingat the same time. - Requests per minute (RPM) — how many
POST /v1/predictionscalls for this model you can make within a rolling 60-second window.
Limits by Account Tier
Your rate limits are determined by your account tier, which is based on cumulative monthly spend. All limits apply per account across all API keys.
| Tier | Monthly Spend | Requests / min |
|---|---|---|
| Tier 1 | $0 – $100 | 25 |
| Tier 2 | $100 – $500 | 50 |
| Tier 3 | $500 – $2,000 | 150 |
| Tier 4 | $2,000+ | Higher limits |
| Enterprise | Per contract | Custom |
Tier upgrades are applied automatically as your monthly spend crosses each threshold. Limits also vary by model — compute-intensive models (e.g. video) may have lower effective concurrency within your tier's RPM allowance.
Enterprise limits are negotiated per contract and are not subject to the spending tiers above. Contact Skytells Support to discuss Enterprise pricing.
Checking Current Usage
The response headers on every prediction endpoint include your current usage against each limit:
X-RateLimit-Limit-RPM: 60
X-RateLimit-Remaining-RPM: 47
X-RateLimit-Limit-Concurrent: 5
X-RateLimit-Remaining-Concurrent: 3
X-RateLimit-Reset: 1741910220| Header | Meaning |
|---|---|
X-RateLimit-Limit-RPM | Your RPM ceiling for this model |
X-RateLimit-Remaining-RPM | Requests remaining in the current window |
X-RateLimit-Limit-Concurrent | Max simultaneous active predictions |
X-RateLimit-Remaining-Concurrent | Available prediction slots right now |
X-RateLimit-Reset | Unix timestamp when the RPM window resets |
Reducing Prediction Volume
If you frequently hit model limits, consider these strategies:
| Strategy | Impact |
|---|---|
| Use webhooks instead of polling | Eliminates GET /v1/predictions/:id requests entirely |
| Queue predictions client-side | Submit new predictions only after the previous one completes |
| Cache outputs for identical inputs | Identical prompts often produce equivalent results |
| Use lower-tier models for drafts | Run fast/cheap models first, upsample only approved outputs |
Creating many predictions simultaneously and polling each one aggressively is the fastest way to exhaust both the RPM and concurrency limits. Use webhooks for all production workloads.
How is this guide?