Skip to main content
  1. Blog/

Diagnose Azure OpenAI Throttling with azqr

·847 words·4 mins·
azure azqr azure-openai throttling capacity-planning apim
Carlos Mendible
Author
Carlos Mendible

As organizations scale their Azure OpenAI workloads, throttling (HTTP 429 errors) becomes a critical operational concern. These errors indicate that your requests exceed the provisioned capacity, leading to degraded user experience, failed completions, and potential revenue loss.

This post introduces the Azure Quick Review openai-throttling plugin, which helps you identify throttling patterns, analyze affected deployments, and make data-driven decisions for capacity planning.

Understanding when, where, and why throttling occurs is the first step toward building resilient AI-powered applications.

Prerequisites
#

Before you begin, ensure you have:

  • An Azure subscription with Azure OpenAI or AI Services resources
  • Azure CLI installed and authenticated
  • Latest version of azqr installed
  • Reader permissions on the target subscription(s)

The Throttling Problem
#

Azure OpenAI enforces rate limits based on your deployment’s provisioned throughput (PTU) or tokens-per-minute (TPM) quota. When requests exceed these limits, Azure returns HTTP 429 (Too Many Requests) errors.

Common throttling scenarios include:

  • Traffic spikes: Unexpected surges during peak hours
  • Undersized deployments: Insufficient capacity for workload demands
  • Model contention: Multiple applications sharing the same deployment
  • Missing spillover: No backup deployment to handle overflow

The Risk: Without visibility into throttling patterns, you’re flying blind—potentially losing requests and frustrating users without knowing the root cause.

Using the azqr openai-throttling Command
#

The azqr openai-throttling command scans your Azure OpenAI and Cognitive Services accounts, querying Azure Monitor metrics to detect 429 errors over the past 7 days.

Install the Latest azqr
#

bash -c "$(curl -fsSL https://raw.githubusercontent.com/azure/azqr/main/scripts/install.sh)"

Run the openai-throttling Command
#

# Standalone (scans all subscriptions)
azqr openai-throttling

# Or for a specific subscription
azqr openai-throttling -s <subscription-id>

The plugin queries the AzureOpenAIRequests metric with status code dimensions, providing hourly granularity for the past week.

Analyzing the Results
#

The output includes detailed information for each hour, deployment, and model:

ColumnDescription
SubscriptionThe Azure subscription name
Resource GroupResource group containing the account
Account NameAzure OpenAI or AI Services account
KindAccount type (OpenAI, AIServices)
SKUPricing tier
Deployment NameThe model deployment experiencing requests
Model NameThe underlying model (gpt-4, gpt-35-turbo, etc.)
Spillover EnabledWhether spillover is configured
Spillover DeploymentTarget deployment for overflow traffic
HourThe hour when requests occurred
Status CodeHTTP status code (200, 429, etc.)
Request CountNumber of requests with that status

Key Areas to Focus On
#

When analyzing results, pay attention to:

1. Instances Experiencing Throttling

Filter for rows where Status Code = 429. These are your problem areas. Look for:

  • Which accounts have the highest 429 counts
  • Whether throttling is isolated to specific deployments or widespread

2. Deployments and Models Affected

Identify which model deployments are hitting limits:

  • High-demand models like gpt-4 may need more capacity
  • Shared deployments serving multiple applications are throttling candidates

3. Time Patterns of Throttling (Peak Hours)

Look for temporal patterns:

  • Business hours throttling suggests production workload issues
  • Batch processing windows may create predictable spikes
  • Overnight throttling could indicate scheduled jobs or global users

4. Spillover Configuration Status

Check the Spillover Enabled column:

  • Deployments showing No with high 429 counts are candidates for spillover configuration
  • Verify spillover deployments have sufficient capacity themselves

Recommendations
#

Based on your analysis, consider the following strategies:

Capacity Planning and Scaling
#

# Export data for capacity analysis
azqr openai-throttling --json --stdout > throttling-data.json

# Analyze peak hours and calculate required capacity
cat throttling-data.json | jq '.externalPlugins."openai-throttling".data | [.[] | select(.statusCode == "429")] | group_by(.deploymentName) | map({deployment: .[0].deploymentName, total_requests: map(.requestCount | tonumber) | add})'

Actions:

  • Increase PTU allocation for consistently throttled deployments
  • Consider provisioned throughput for predictable workloads
  • Request quota increases for TPM-limited deployments

Load Distribution with Azure API Management
#

Azure API Management (APIM) can intelligently distribute load across multiple Azure OpenAI instances:

<!-- APIM policy for round-robin load balancing -->
<policies>
    <inbound>
        <set-backend-service backend-id="openai-backend-pool" />
    </inbound>
</policies>

Benefits:

  • Distribute requests across multiple deployments or regions
  • Implement retry logic with automatic failover
  • Add caching to reduce redundant requests
  • Monitor and rate-limit by application or user

Spillover Configuration Optimization
#

For deployments without spillover, configure a backup:

# Using Azure CLI to update deployment with spillover
az cognitiveservices account deployment create \
  --name <account-name> \
  --resource-group <resource-group> \
  --deployment-name <deployment-name> \
  --model-format OpenAI \
  --model-name gpt-4 \
  --model-version "0613" \
  --sku-capacity 10 \
  --sku-name Standard \
  --spillover-deployment-name <backup-deployment-name>

Best practices:

  • Ensure spillover deployments have adequate capacity
  • Consider using different regions for spillover (resilience)
  • Monitor spillover usage to detect capacity planning issues

Integrating with Full Scans
#

You can combine throttling analysis with a comprehensive Azure review:

# Run full scan with throttling plugin
azqr scan --plugin openai-throttling --output-name openai-analysis

# View results in interactive dashboard
azqr show -f openai-analysis.xlsx --open

Conclusion
#

Throttling is inevitable at scale, but it doesn’t have to be unpredictable. The azqr openai-throttling plugin gives you visibility into your Azure OpenAI capacity constraints, enabling proactive capacity planning instead of reactive firefighting.

Use the data to right-size your deployments, implement intelligent load distribution with APIM, and configure spillover for resilience.

Hope it helps!

References: