Rate Limits

ai& enforces rate limits per organization to keep the platform healthy under shared load. Limits depend on your tier and apply to inference endpoints; management APIs have their own (looser) limits.

Tiers

Tier	Description
Tier 0	Evaluation tier. Lower per-minute caps. Suitable for development and small projects. New orgs start here.
Tier 1	Production tier. Higher caps and access to higher-throughput models. Orgs are promoted on their first successful payment.

What’s measured

Each request is checked against six buckets — four per-model and two per-org global:

Per-model RPM — requests per minute against one specific model.
Per-model Input TPM — input tokens per minute (estimated up-front from the request body).
Per-model Output TPM — output tokens per minute (charged after the response completes).
Per-model Concurrency — max in-flight requests against one model.
Global RPM — requests per minute across all models you call.
Global Concurrency — max in-flight requests across all models.

Whichever bucket fills first triggers throttling.

Headers

Every response carries:

Header	Meaning
`X-RateLimit-Limit`	Your effective RPM cap — whichever of the per-model or global RPM bucket is currently more constrained.
`X-RateLimit-Remaining`	Requests left in that same bucket.

On a 429 Too Many Requests, two more headers are added:

Header	Meaning
`X-RateLimit-Policy`	Which bucket denied the request: `rpm`, `global_rpm`, `input_tpm`, `output_tpm`, `concurrency`, or `global_concurrency`.
`Retry-After`	Seconds until the offending bucket has capacity again. Omitted for `concurrency` and `global_concurrency` rejects — finish or cancel in-flight requests instead.

See Response Headers for non-rate-limit headers.

429 responses

When throttled, ai& returns 429 Too Many Requests. Back off and retry — exponential backoff with jitter is recommended. Retry-After tells you the minimum safe delay for time-based rejects.