The model rate page displays the call rate limits for different model types in the AI service. You can use this page to understand the maximum tokens per minute (TPM) and maximum requests per minute (RPM) for each model type, which can help you when integrating the AI service into your application, designing your calling strategy, and troubleshooting rate limit issues.
Click your organization name in the top navigation bar to go to the Model Rate page.
Note
- Model rate limits: These are the maximum limits at the organization level.
- Calling limits: The AI service typically limits calls based on both the maximum tokens per minute (
TPM) and the maximum requests per minute (RPM). - Excess limit handling: If the number of tokens or requests exceeds the limit within a minute, subsequent requests in the remaining minutes may be restricted to ensure fair resource usage and system stability.
- Request limit increase: If your business requires a higher rate limit, you can contact the support team based on the page instructions.
The list area displays the available models and their corresponding rate limits by model type, typically including the following fields:
Field |
Description |
|---|---|
| Type | The task type of the model, such as chat, embedding, rerank, vl-embedding, or vl-rerank. |
| Model | The name of the specific model available for this type. |
| Tokens per minute (TPM) | The maximum number of tokens that can be consumed by this type of model in one minute. If your business has long single requests or models that generate a lot of output, even with a low number of requests, you may reach the TPM limit first. |
| Requests per minute (RPM) | The maximum number of requests that can be initiated by this type of model in one minute. If your business has lightweight requests but a high calling frequency, you may reach the RPM limit first. |
Scenarios
Model rate limit information applies to the following scenarios:
Application integration assessment: Before integrating your application with the AI service, assess whether the current model rate meet your business's peak demand.
Troubleshooting rate limit issues: When a call fails, a request is rejected, or a response is abnormal, determine whether the
TPMorRPMlimit has been triggered.Optimizing calling strategies: Optimize request frequency, batch processing methods, and retry strategies based on the rate limits of different model types.
Capacity planning: Before your business grows, assess whether you need to apply for a higher rate limit.
Recommendations
We recommend that you pay attention to both the
TPMandRPMlimits, rather than just the number of requests.We recommend that you add local rate limiting, queuing, or retry mechanisms for high-concurrency businesses to avoid triggering platform rate throttling due to sudden traffic spikes.
If your single requests have long input or large output, we recommend that you focus on the
TPMconsumption.We recommend that you plan the calling peak at the organization level to avoid triggering platform rate throttling.
