Diagnose Azure OpenAI Throttling with azqr

As organizations scale their Azure OpenAI workloads, throttling (HTTP 429 errors) becomes a critical operational concern. These errors indicate that your requests exceed the provisioned capacity, leading to degraded user experience, failed completions, and potential revenue loss.

This post introduces the Azure Quick Review openai-throttling plugin, which helps you identify throttling patterns, analyze affected deployments, and make data-driven decisions for capacity planning.

Understanding when, where, and why throttling occurs is the first step toward building resilient AI-powered applications.

Prerequisites
#

Before you begin, ensure you have:

An Azure subscription with Azure OpenAI or AI Services resources
Azure CLI installed and authenticated
Latest version of azqr installed
Reader permissions on the target subscription(s)

The Throttling Problem
#

Azure OpenAI enforces rate limits based on your deployment’s provisioned throughput (PTU) or tokens-per-minute (TPM) quota. When requests exceed these limits, Azure returns HTTP 429 (Too Many Requests) errors.

Common throttling scenarios include:

Traffic spikes: Unexpected surges during peak hours
Undersized deployments: Insufficient capacity for workload demands
Model contention: Multiple applications sharing the same deployment
Missing spillover: No backup deployment to handle overflow

The Risk: Without visibility into throttling patterns, you’re flying blind—potentially losing requests and frustrating users without knowing the root cause.

Using the azqr openai-throttling Command
#

The azqr openai-throttling command scans your Azure OpenAI and Cognitive Services accounts, querying Azure Monitor metrics to detect 429 errors over the past 7 days.

Install the Latest azqr
#

bash -c "$(curl -fsSL https://raw.githubusercontent.com/azure/azqr/main/scripts/install.sh)"

Run the openai-throttling Command
#

# Standalone (scans all subscriptions)
azqr openai-throttling

# Or for a specific subscription
azqr openai-throttling -s <subscription-id>

The plugin queries the AzureOpenAIRequests metric with status code dimensions, providing hourly granularity for the past week.

Analyzing the Results
#

The output includes detailed information for each hour, deployment, and model:

Column	Description
Subscription	The Azure subscription name
Resource Group	Resource group containing the account
Account Name	Azure OpenAI or AI Services account
Kind	Account type (OpenAI, AIServices)
SKU	Pricing tier
Deployment Name	The model deployment experiencing requests
Model Name	The underlying model (gpt-4, gpt-35-turbo, etc.)
Spillover Enabled	Whether spillover is configured
Spillover Deployment	Target deployment for overflow traffic
Hour	The hour when requests occurred
Status Code	HTTP status code (200, 429, etc.)
Request Count	Number of requests with that status

Key Areas to Focus On
#

When analyzing results, pay attention to:

1. Instances Experiencing Throttling

Filter for rows where Status Code = 429. These are your problem areas. Look for:

Which accounts have the highest 429 counts
Whether throttling is isolated to specific deployments or widespread

2. Deployments and Models Affected

Identify which model deployments are hitting limits:

High-demand models like gpt-4 may need more capacity
Shared deployments serving multiple applications are throttling candidates

3. Time Patterns of Throttling (Peak Hours)

Look for temporal patterns:

Business hours throttling suggests production workload issues
Batch processing windows may create predictable spikes
Overnight throttling could indicate scheduled jobs or global users

4. Spillover Configuration Status

Check the Spillover Enabled column:

Deployments showing No with high 429 counts are candidates for spillover configuration
Verify spillover deployments have sufficient capacity themselves

Recommendations
#

Based on your analysis, consider the following strategies:

Capacity Planning and Scaling
#

# Export data for capacity analysis
azqr openai-throttling --json --stdout > throttling-data.json

# Analyze peak hours and calculate required capacity
cat throttling-data.json | jq '.externalPlugins."openai-throttling".data | [.[] | select(.statusCode == "429")] | group_by(.deploymentName) | map({deployment: .[0].deploymentName, total_requests: map(.requestCount | tonumber) | add})'

Actions:

Increase PTU allocation for consistently throttled deployments
Consider provisioned throughput for predictable workloads
Request quota increases for TPM-limited deployments

Load Distribution with Azure API Management
#

Azure API Management (APIM) can intelligently distribute load across multiple Azure OpenAI instances:

<!-- APIM policy for round-robin load balancing -->
<policies>
    <inbound>
        <set-backend-service backend-id="openai-backend-pool" />
    </inbound>
</policies>

Benefits:

Distribute requests across multiple deployments or regions
Implement retry logic with automatic failover
Add caching to reduce redundant requests
Monitor and rate-limit by application or user

Spillover Configuration Optimization
#

For deployments without spillover, configure a backup:

# Using Azure CLI to update deployment with spillover
az cognitiveservices account deployment create \
  --name <account-name> \
  --resource-group <resource-group> \
  --deployment-name <deployment-name> \
  --model-format OpenAI \
  --model-name gpt-4 \
  --model-version "0613" \
  --sku-capacity 10 \
  --sku-name Standard \
  --spillover-deployment-name <backup-deployment-name>

Best practices:

Ensure spillover deployments have adequate capacity
Consider using different regions for spillover (resilience)
Monitor spillover usage to detect capacity planning issues

Integrating with Full Scans
#

You can combine throttling analysis with a comprehensive Azure review:

# Run full scan with throttling plugin
azqr scan --plugin openai-throttling --output-name openai-analysis

# View results in interactive dashboard
azqr show -f openai-analysis.xlsx --open

Conclusion
#

Throttling is inevitable at scale, but it doesn’t have to be unpredictable. The azqr openai-throttling plugin gives you visibility into your Azure OpenAI capacity constraints, enabling proactive capacity planning instead of reactive firefighting.

Use the data to right-size your deployments, implement intelligent load distribution with APIM, and configure spillover for resilience.

Hope it helps!

References:

Prerequisites#

The Throttling Problem#

Using the azqr openai-throttling Command#

Install the Latest azqr#

Run the openai-throttling Command#

Analyzing the Results#

Key Areas to Focus On#

Recommendations#

Capacity Planning and Scaling#

Load Distribution with Azure API Management#

Spillover Configuration Optimization#

Integrating with Full Scans#

Conclusion#