Skip to content

Rate Limits

ai& enforces rate limits per organization to keep the platform healthy under shared load. Limits depend on your tier and apply to inference endpoints; management APIs have their own (looser) limits.

TierDescription
Tier 0Evaluation tier. Lower per-minute caps. Suitable for development and small projects. New orgs start here.
Tier 1Production tier. Higher caps and access to higher-throughput models. Orgs are promoted on their first successful payment.

Each request is checked against six buckets — four per-model and two per-org global:

  • Per-model RPM — requests per minute against one specific model.
  • Per-model Input TPM — input tokens per minute (estimated up-front from the request body).
  • Per-model Output TPM — output tokens per minute (charged after the response completes).
  • Per-model Concurrency — max in-flight requests against one model.
  • Global RPM — requests per minute across all models you call.
  • Global Concurrency — max in-flight requests across all models.

Whichever bucket fills first triggers throttling.

Every response carries:

HeaderMeaning
X-RateLimit-LimitYour effective RPM cap — whichever of the per-model or global RPM bucket is currently more constrained.
X-RateLimit-RemainingRequests left in that same bucket.

On a 429 Too Many Requests, two more headers are added:

HeaderMeaning
X-RateLimit-PolicyWhich bucket denied the request: rpm, global_rpm, input_tpm, output_tpm, concurrency, or global_concurrency.
Retry-AfterSeconds until the offending bucket has capacity again. Omitted for concurrency and global_concurrency rejects — finish or cancel in-flight requests instead.

See Response Headers for non-rate-limit headers.

When throttled, ai& returns 429 Too Many Requests. Back off and retry — exponential backoff with jitter is recommended. Retry-After tells you the minimum safe delay for time-based rejects.