status-guide

Is Azure Down? Microsoft Azure Status Guide & Diagnostics 2026

Is Microsoft Azure Down Right Now?

Microsoft Azure is the second-largest cloud platform globally, powering enterprise workloads across App Service, AKS, Azure SQL, Event Hubs, Cosmos DB, and hundreds of other services. Azure outages can cascade across dependent services and affect production workloads across regions.

Here's how to check Azure status in seconds and diagnose whether you're facing a platform issue or a configuration problem.

Step 1: Check Azure's Status Pages

Azure has two status interfaces — you need both:

1. Azure Status Page (all customers)

azure.status.microsoft — shows active incidents and a global heatmap by service and region. Updated continuously during incidents.

2. Azure Service Health (your specific resources)

Azure Portal → Service Health — shows incidents affecting your specific subscriptions, regions, and services. This is more accurate than the public status page for your workloads because not every incident affects every customer.

Set up Service Health alerts

# Create a Service Health alert via Azure CLI
az monitor activity-log alert create \
  --name "azure-service-health-alert" \
  --resource-group my-rg \
  --scope "/subscriptions/$(az account show --query id -o tsv)" \
  --condition category=ServiceHealth \
  --action-group my-action-group

RSS feeds:


Azure Service Architecture: What Can Break

Service What it does Common failure modes
App Service / Web Apps Managed web hosting (Linux/Windows) Deployment failures, SSL binding errors, CORS issues, custom domain validation
Azure Kubernetes Service (AKS) Managed Kubernetes Control plane API unavailable, node pool upgrade stuck, CNI plugin failures
Azure Functions Serverless compute Cold start timeouts, trigger binding errors, deployment package too large
Azure SQL Database Managed SQL Server (PaaS) Connection pool exhaustion, DTU/vCore throttling, failover to geo-replica
Cosmos DB Multi-model NoSQL database RU/s throughput exceeded (429), regional failover, consistency level issues
Event Hubs / Service Bus Managed messaging Throughput unit exhaustion, namespace failover, consumer group lock conflicts
Azure Blob Storage Object storage Storage account throttling, geo-redundancy failover, access tier transitions
Azure Active Directory / Entra ID Identity and authentication Token endpoint delays, MFA push failures, conditional access evaluation errors
Azure OpenAI Service Hosted OpenAI models Token rate limits (429), multi-region capacity exhaustion (see Q1 2026 incident)
Azure Container Registry (ACR) Private container image registry Pull failures during deployments, geo-replication lag

Is It Azure or Your App? Diagnostic Table

Symptom Most likely cause How to verify
App Service returns 503 App crash, scaling limits, or App Service plan throttled Check App Service → Diagnose and Solve Problems → Availability
AKS kubectl commands fail Control plane API unavailable (Azure incident) OR RBAC/credentials issue az aks get-credentials and retry; check Azure status for AKS in your region
Azure Functions not triggering Storage account issue (Functions uses Storage internally), trigger binding error Check Functions → Monitor for invocation errors; verify AzureWebJobsStorage connection
Azure SQL: "connection timeout" DTU/vCore exhaustion, max connections reached, or geo-failover in progress Check Azure SQL → Metrics → DTU percentage; look for high wait stats
Cosmos DB: 429 Too Many Requests RU/s limit reached (not an outage — provisioned throughput exceeded) Check Cosmos DB → Insights → Rate of requests throttled; increase RU/s or use autoscale
Event Hubs: consumers not receiving messages Throughput unit limit, consumer group offset issue, or namespace geo-failover Check Event Hubs → Metrics → Throttled Requests; verify consumer group checkpoint
Azure AD: login failures for users Entra ID / Azure AD incident (can be global) Check azure.status.microsoft for Azure Active Directory / Entra
Azure OpenAI: 503 errors Regional capacity exhaustion or quota exceeded Check your quota in Azure Portal → Azure OpenAI → Quotas; try different region
Deployment fails with "ResourceNotFound" ARM template issue, resource provider not registered, or quota limit Check Azure Portal → Activity Log for deployment errors
Works in West Europe, fails in East US Region-specific Azure incident Check status page for the specific region; deploy to a secondary region

Q1 2026 Azure Incidents (Public Record)

Microsoft publishes detailed Post-Incident Reviews (PIRs) at azure.status.microsoft/en-us/status/history/. Notable Q1 2026 incidents from public reporting:

Date Service Duration Description Source
March 9–10, 2026 Azure OpenAI Service ~20 hours Multi-region degradation affecting Azure OpenAI 5.2 models across Australia East, Sweden Central, Central US, East US 2, Korea Central, Norway East, UK South. HTTP 400/429 errors. Root cause: unexpected resource exhaustion triggered by a configuration change. Source: Azure status history azure.status.microsoft
February 2026 Azure Virtual Machines Multiple hours Virtual machine outage with cascading effects on dependent Azure services. Reported by The Register azure.status.microsoft

All data sourced from Microsoft's public status history and published incident reports. For the full list of Q1 2026 incidents with root cause analyses, check azure.status.microsoft/en-us/status/history/ directly.


Azure Resilience Patterns

1. Availability Zones vs. Regions

Most Azure regions offer 3 Availability Zones (physically separate datacenters within a region). Zone-redundant deployments survive a single AZ failure. However, they don't protect against:

  • Region-wide control plane issues (API management, ARM)
  • Global services like Azure AD / Entra ID
  • Configuration changes applied globally (like the March 2026 Azure OpenAI incident)

2. Azure Paired Regions

Azure pairs regions for geo-redundant replication (e.g., East US ↔ West US, North Europe ↔ West Europe). Critical services:

  • Azure SQL geo-replication uses paired regions by default
  • Geo-redundant storage (GRS) replicates to paired region
  • Azure Site Recovery uses paired regions for DR

3. Retry logic for transient errors

import time
import random

def azure_retry(func, max_retries=5, base_delay=1.0):
    """
    Exponential backoff with jitter for Azure SDK calls.
    Handles 429 (throttle), 503 (unavailable), and transient 5xx errors.
    """
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            error_code = getattr(e, 'error_code', None) or str(e)

            # Don't retry on permanent errors
            if '404' in str(error_code) or '403' in str(error_code):
                raise

            if attempt == max_retries - 1:
                raise

            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)

            # Honor Retry-After header if present (Azure 429s include it)
            retry_after = getattr(e, 'retry_after', None)
            if retry_after:
                delay = max(delay, float(retry_after))

            time.sleep(delay)

    raise Exception("Max retries exceeded")

Azure CLI Quick Diagnostic Commands

# Check App Service health
az webapp show --name my-app --resource-group my-rg --query "state"

# Check AKS cluster state
az aks show --name my-cluster --resource-group my-rg --query "powerState"

# List recent AKS node pool events
az aks get-credentials --name my-cluster --resource-group my-rg
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

# Check Azure SQL database status
az sql db show --name my-db --server my-server --resource-group my-rg \
  --query "[status, currentServiceObjectiveName]"

# Check Azure Functions recent invocations
az functionapp list-functions --name my-func-app --resource-group my-rg

# View activity log (deployment errors)
az monitor activity-log list --resource-group my-rg \
  --start-time $(date -u -v-2H +%Y-%m-%dT%H:%M:%SZ) \
  --query "[?level=='Error'] | [].{time:eventTimestamp, op:operationName.value, status:status.value}"

External Monitoring for Azure

Azure Monitor's built-in availability tests run from within Azure infrastructure — they can miss failures that only appear when traffic routes through the global load balancer or comes from outside Azure's network. Use an external monitoring service like ezmon.com to check your Azure-hosted services from multiple independent geographic locations.

External monitoring catches:

  • Azure Front Door / Traffic Manager routing failures
  • DNS propagation issues from Azure DNS
  • SSL certificate problems (expired or mis-issued certs)
  • Geographic routing errors (Azure Traffic Manager weighted/priority policies)

Azure Status Quick Reference


Related Guides


Bottom Line

When Azure shows symptoms:

  1. Check azure.status.microsoft AND Azure Portal → Service Health
  2. Service Health is more accurate — it filters to your specific subscriptions and regions
  3. Use az monitor activity-log list to see recent deployment/operation errors
  4. Check resource-specific metrics (DTU%, RU/s, throughput units) before assuming it's Azure
  5. If confirmed Azure incident: implement regional failover or wait for remediation

Monitor your Azure-hosted services from outside Azure at ezmon.com — multi-location uptime monitoring that catches what Azure Monitor misses.

azuremicrosoft-azuredownstatusoutageapp-serviceakscloud