The Rate Limits & Quotas page displays the call quotas and rate limits for different model types in the AI service. You can use this page to understand the monthly token quota, maximum tokens per minute (TPM), and maximum requests per minute (RPM) for each model type, which can help you when integrating the AI service into your application, designing your calling strategy, and troubleshooting rate limit issues.
Click your organization name in the top navigation bar to go to the Rate Limits & Quotas page.
About rate limits & quotas
- The model rate limits at the organization level will apply as the highest limits to all projects under the organization.
- API usage is limited by the maximum tokens per minute (TPM) and the maximum requests per minute (RPM). If the number of tokens or requests exceeds the limit within a minute, subsequent requests in that minute will be restricted to ensure fair resource usage and system stability. To request a higher organizational rate limit, submit a ticket.
The page displays the rate limits & quotas under the current organization by model type. The list includes the following fields:
Field |
Description |
|---|---|
| Model type | The task type of the model, such as text generation, text embedding, multimodal, image generation, video generation, etc. |
| Monthly quota (M Tokens) | The maximum number of tokens that can be consumed by this model type within a month, in millions of tokens. |
| Tokens per minute (TPM) | The maximum number of tokens that can be consumed by this model type in one minute. |
| Requests per minute (RPM) | The maximum number of requests that can be initiated by this model type in one minute. |
Scenarios
Rate limits & quotas information applies to the following scenarios:
Application integration assessment: Before integrating your application with the AI service, assess whether the current rate limits & quotas meet your business's peak demand.
Troubleshooting rate limit issues: When a call fails, a request is rejected, or a response is abnormal, determine whether the TPM or RPM limit has been triggered.
Optimizing calling strategies: Optimize request frequency, batch processing methods, and retry strategies based on the rate limits & quotas of different model types.
Capacity planning: Before your business grows, assess whether you need to apply for a higher rate limit.
Recommendations
We recommend that you pay attention to both the TPM and RPM limits, rather than just the number of requests.
We recommend that you add local rate limiting, queuing, or retry mechanisms for high-concurrency businesses to avoid triggering platform rate throttling due to sudden traffic spikes.
If your single requests have long input or large output, we recommend that you focus on the TPM consumption.
We recommend that you plan the calling peak at the organization level to avoid triggering platform rate throttling.
