status-guide

Is Azure Down? Microsoft Azure Status Guide & Diagnostics 2026

ezmon.com team • March 18, 2026 • 9 min read

Is Microsoft Azure Down Right Now?

Microsoft Azure is the second-largest cloud platform globally, powering enterprise workloads across App Service, AKS, Azure SQL, Event Hubs, Cosmos DB, and hundreds of other services. Azure outages can cascade across dependent services and affect production workloads across regions.

Here's how to check Azure status in seconds and diagnose whether you're facing a platform issue or a configuration problem.

Step 1: Check Azure's Status Pages

Azure has two status interfaces — you need both:

1. Azure Status Page (all customers)

azure.status.microsoft — shows active incidents and a global heatmap by service and region. Updated continuously during incidents.

2. Azure Service Health (your specific resources)

Azure Portal → Service Health — shows incidents affecting your specific subscriptions, regions, and services. This is more accurate than the public status page for your workloads because not every incident affects every customer.

Set up Service Health alerts

# Create a Service Health alert via Azure CLI
az monitor activity-log alert create \
  --name "azure-service-health-alert" \
  --resource-group my-rg \
  --scope "/subscriptions/$(az account show --query id -o tsv)" \
  --condition category=ServiceHealth \
  --action-group my-action-group

RSS feeds:

azure.status.microsoft/en-us/status/feed/ — general status

Azure Service Architecture: What Can Break

Service	What it does	Common failure modes
App Service / Web Apps	Managed web hosting (Linux/Windows)	Deployment failures, SSL binding errors, CORS issues, custom domain validation
Azure Kubernetes Service (AKS)	Managed Kubernetes	Control plane API unavailable, node pool upgrade stuck, CNI plugin failures
Azure Functions	Serverless compute	Cold start timeouts, trigger binding errors, deployment package too large
Azure SQL Database	Managed SQL Server (PaaS)	Connection pool exhaustion, DTU/vCore throttling, failover to geo-replica
Cosmos DB	Multi-model NoSQL database	RU/s throughput exceeded (429), regional failover, consistency level issues
Event Hubs / Service Bus	Managed messaging	Throughput unit exhaustion, namespace failover, consumer group lock conflicts
Azure Blob Storage	Object storage	Storage account throttling, geo-redundancy failover, access tier transitions
Azure Active Directory / Entra ID	Identity and authentication	Token endpoint delays, MFA push failures, conditional access evaluation errors
Azure OpenAI Service	Hosted OpenAI models	Token rate limits (429), multi-region capacity exhaustion (see Q1 2026 incident)
Azure Container Registry (ACR)	Private container image registry	Pull failures during deployments, geo-replication lag

Is It Azure or Your App? Diagnostic Table

Symptom	Most likely cause	How to verify
App Service returns 503	App crash, scaling limits, or App Service plan throttled	Check App Service → Diagnose and Solve Problems → Availability
AKS kubectl commands fail	Control plane API unavailable (Azure incident) OR RBAC/credentials issue	`az aks get-credentials` and retry; check Azure status for AKS in your region
Azure Functions not triggering	Storage account issue (Functions uses Storage internally), trigger binding error	Check Functions → Monitor for invocation errors; verify AzureWebJobsStorage connection
Azure SQL: "connection timeout"	DTU/vCore exhaustion, max connections reached, or geo-failover in progress	Check Azure SQL → Metrics → DTU percentage; look for high wait stats
Cosmos DB: 429 Too Many Requests	RU/s limit reached (not an outage — provisioned throughput exceeded)	Check Cosmos DB → Insights → Rate of requests throttled; increase RU/s or use autoscale
Event Hubs: consumers not receiving messages	Throughput unit limit, consumer group offset issue, or namespace geo-failover	Check Event Hubs → Metrics → Throttled Requests; verify consumer group checkpoint
Azure AD: login failures for users	Entra ID / Azure AD incident (can be global)	Check azure.status.microsoft for Azure Active Directory / Entra
Azure OpenAI: 503 errors	Regional capacity exhaustion or quota exceeded	Check your quota in Azure Portal → Azure OpenAI → Quotas; try different region
Deployment fails with "ResourceNotFound"	ARM template issue, resource provider not registered, or quota limit	Check Azure Portal → Activity Log for deployment errors
Works in West Europe, fails in East US	Region-specific Azure incident	Check status page for the specific region; deploy to a secondary region

Q1 2026 Azure Incidents (Public Record)

Microsoft publishes detailed Post-Incident Reviews (PIRs) at azure.status.microsoft/en-us/status/history/. Notable Q1 2026 incidents from public reporting:

Date	Service	Duration	Description	Source
March 9–10, 2026	Azure OpenAI Service	~20 hours	Multi-region degradation affecting Azure OpenAI 5.2 models across Australia East, Sweden Central, Central US, East US 2, Korea Central, Norway East, UK South. HTTP 400/429 errors. Root cause: unexpected resource exhaustion triggered by a configuration change. Source: Azure status history	azure.status.microsoft
February 2026	Azure Virtual Machines	Multiple hours	Virtual machine outage with cascading effects on dependent Azure services. Reported by The Register	azure.status.microsoft

All data sourced from Microsoft's public status history and published incident reports. For the full list of Q1 2026 incidents with root cause analyses, check azure.status.microsoft/en-us/status/history/ directly.

Azure Resilience Patterns

1. Availability Zones vs. Regions

Most Azure regions offer 3 Availability Zones (physically separate datacenters within a region). Zone-redundant deployments survive a single AZ failure. However, they don't protect against:

Region-wide control plane issues (API management, ARM)
Global services like Azure AD / Entra ID
Configuration changes applied globally (like the March 2026 Azure OpenAI incident)

2. Azure Paired Regions

Azure pairs regions for geo-redundant replication (e.g., East US ↔ West US, North Europe ↔ West Europe). Critical services:

Azure SQL geo-replication uses paired regions by default
Geo-redundant storage (GRS) replicates to paired region
Azure Site Recovery uses paired regions for DR

3. Retry logic for transient errors

import time
import random

def azure_retry(func, max_retries=5, base_delay=1.0):
    """
    Exponential backoff with jitter for Azure SDK calls.
    Handles 429 (throttle), 503 (unavailable), and transient 5xx errors.
    """
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            error_code = getattr(e, 'error_code', None) or str(e)

            # Don't retry on permanent errors
            if '404' in str(error_code) or '403' in str(error_code):
                raise

            if attempt == max_retries - 1:
                raise

            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)

            # Honor Retry-After header if present (Azure 429s include it)
            retry_after = getattr(e, 'retry_after', None)
            if retry_after:
                delay = max(delay, float(retry_after))

            time.sleep(delay)

    raise Exception("Max retries exceeded")

Azure CLI Quick Diagnostic Commands

# Check App Service health
az webapp show --name my-app --resource-group my-rg --query "state"

# Check AKS cluster state
az aks show --name my-cluster --resource-group my-rg --query "powerState"

# List recent AKS node pool events
az aks get-credentials --name my-cluster --resource-group my-rg
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

# Check Azure SQL database status
az sql db show --name my-db --server my-server --resource-group my-rg \
  --query "[status, currentServiceObjectiveName]"

# Check Azure Functions recent invocations
az functionapp list-functions --name my-func-app --resource-group my-rg

# View activity log (deployment errors)
az monitor activity-log list --resource-group my-rg \
  --start-time $(date -u -v-2H +%Y-%m-%dT%H:%M:%SZ) \
  --query "[?level=='Error'] | [].{time:eventTimestamp, op:operationName.value, status:status.value}"

External Monitoring for Azure

Azure Monitor's built-in availability tests run from within Azure infrastructure — they can miss failures that only appear when traffic routes through the global load balancer or comes from outside Azure's network. Use an external monitoring service like ezmon.com to check your Azure-hosted services from multiple independent geographic locations.

External monitoring catches:

Azure Front Door / Traffic Manager routing failures
DNS propagation issues from Azure DNS
SSL certificate problems (expired or mis-issued certs)
Geographic routing errors (Azure Traffic Manager weighted/priority policies)

Azure Status Quick Reference

Status page: azure.status.microsoft
Incident history + PIRs: azure.status.microsoft/en-us/status/history/
Personal Service Health: Azure Portal → Service Health
RSS: azure.status.microsoft/en-us/status/feed/
@AzureSupport on X: for active incident updates
Azure documentation: Availability monitoring guide

Related Guides

Bottom Line

When Azure shows symptoms:

Check azure.status.microsoft AND Azure Portal → Service Health
Service Health is more accurate — it filters to your specific subscriptions and regions
Use az monitor activity-log list to see recent deployment/operation errors
Check resource-specific metrics (DTU%, RU/s, throughput units) before assuming it's Azure
If confirmed Azure incident: implement regional failover or wait for remediation

Monitor your Azure-hosted services from outside Azure at ezmon.com — multi-location uptime monitoring that catches what Azure Monitor misses.

azuremicrosoft-azuredownstatusoutageapp-serviceakscloud