These incidents have not yet been reviewed. Use the filter to find yours.
| Incident / DRI / Status | Questions & Answers | Feedback & Work Items | Follow-up |
|---|---|---|---|
| INC 51000001009580 Sev2 Mitigated Title: Intermittent 500 responses from APIM DRI: Maxim A Created: 2026-05-06 Mitigated: 2026-05-06 18:03 UTC TTM: 9h 37m Customer: myAIS (~10M subscribers) Service: apim-adl-connectivity-hub-jpe-prd (Premium, Japan East) | 1. Customer ComplaintCustomer reported intermittent HTTP 500 GatewayFailure errors with "Unable to connect to the remote server" and transportErrorCode 10048 on their Premium SKU APIM service in Japan East. Impact on API traffic forwarding from APIM to backend services via Internal Load Balancer. Management endpoint remained accessible. Customer platform (myAIS) serves ~10 million subscribers for mobile billing/payments — high impact. 2. What customer sawRequests routed through APIM intermittently failed with HTTP 500. Diagnostic logs: lastError_reason: GatewayFailure, lastError_message: "Unable to connect to the remote server", transportErrorCode: 10048 (WSAEADDRINUSE). Some requests succeeded while others failed. Direct requests to backend without APIM did not consistently reproduce. 3. CSS telemetryCSS identified via ProxyRequest that all HTTP 500s with backendResponseCode=0 originated from a single VM (gwhost_14058). SRE Agent confirmed 3,781 GatewayFailure errors on gwhost_14058 vs zero on all other 23 VMs during 03:00–08:00 UTC window. Faulty VM had 7x higher backend latency (73.8ms vs ~10ms peers) and 21.3% connection failure rate to customer ILB at 172.16.9.5. SNAT port exhaustion ruled out. 4. Diagnosis & FixVM-level networking degradation on gwhost_14058 causing elevated outbound connection latency, leading to TCP TIME_WAIT accumulation and ephemeral port exhaustion (error 10048/WSAEADDRINUSE). Gradual onset over ~2 hours consistent with port pool saturation. Two contributing factors: (1) delayed scaling from 2 to 24 VMs simultaneously caused traffic processing degradation, (2) single VM behaved abnormally after rapid scale-out. 5. Monitoring Miss?Yes. No LSI filed before CRI. VM-level degradation did not trigger health monitoring because all NodeHeartbeat and Management checks passed on gwhost_14058. Single-VM ephemeral port exhaustion with elevated backend latency is not currently detected by existing monitors. 6. RepairsNo formal repair items documented. SRE Agent noted backend-returned 500s from myais-be.cloud.ais.th (backendResponseCode=500) are a separate customer application issue — recommended clarifying the two distinct error populations to the customer. | Feedback: (placeholder) Work Items:None | |
| INC 51000001019637 Sev2 Resolved Customer Error — Sev2→3 Title: Received invalid status code: 500 DRI: Ondrej Oprala (ondrejoprala) EIM: Maxim Kim SIM: Martin Dechev Created: 2026-05-13 Resolved: 2026-05-14 16:15 UTC Service: T7PAPIMUKS01 (PremiumV2, UK South, VNET-injected) Blast Radius: 3 PremiumV2 services in UK South Duration: Crash loop May 10 → full outage May 12 19:55 UTC | 1. Customer ComplaintAPIM V2 service unreachable. Application Gateway health probe receiving HTTP 500 from APIM backend, resulting in complete service outage for T7PAPIMUKS01 (PremiumV2, UK South, VNET-injected). 2. What customer sawApplication Gateway reported "Received invalid status code: 500 in the backend server's HTTP response." All traffic failed — APIM gateway process in persistent crash loop, returning default App Service error page (500) to health probes. 3. CSS telemetryProxyInfra: 107 consecutive GatewayStartFailed events since May 2. Exception: 4. Diagnosis & FixCustomer Error — Fortigate NVA blocking outbound TCP to Key Vault/Storage. 5. Monitoring Miss?Partial. Gateway crash loop ran for 3 days (May 10–12) before complete outage. No alert for “all new worker startups failing” pattern. However, root cause is customer-controlled network configuration — APIM cannot monitor customer NVA behavior. 6. Repairs• SRE Agent improvement: new known pattern “SKUv2 Gateway Crash Loop — NVA/Firewall Blocking Dependencies” added (PR #2298) to prevent misattributing to SAS expiration | Feedback: Good collaborative investigation (Ondrej, Nima, Richard Cao, Jarod Aerts, Oliviu). SRE Agent initially misattributed to SAS expiration (MEDIUM confidence noted). Agent also exceeded posting budget (4+ posts vs 2). Triage learning PR #2298 filed to prevent recurrence of misdiagnosis. Work Items:Done PR 2298 - New known CRI pattern for NVA/firewall blocking | |
| INC 51000001023424 Sev2 Mitigated Platform Bug — SKUv2 Orchestration Title: APIM not refreshing certificate for Traffic Manager custom domains DRI: Javier Borrego (javierbo) SIM: Alexander Zaslonov EIM: Maxim Kim Created: 2026-05-15 Mitigated: 2026-05-15 23:04 UTC Duration: 2h 34m Service: apim-cpgapi-prod-eus & apim-cpgapi-prod-wus (BasicV2) Customer: Internal Microsoft (CE-EA-LM-CPG) Risk: Cert expires June 12 — TLS will break if not permanently fixed | 1. Customer ComplaintAPIM service not refreshing the certificate for the Traffic Manager profile under custom domains. Certificate for 2. What customer sawCertificate refresh failing silently. No TLS disruption yet (existing cert still valid), but most recent App Service cert expiring June 12 — imminent risk. Issue persisted since July 2024 with 10 accumulated expired certs. 3. CSS telemetrySRE Agent identified root cause with HIGH confidence from Orchestration telemetry: 4. Diagnosis & FixPlatform Bug — SKUv2 orchestration cannot handle Traffic Manager hostname bindings. 5. Monitoring Miss?Yes. Issue existed since July 2024 (~10 months) with no alert. 6. Repairs• URGENT Bug: Fix | Feedback: Long-standing platform bug (~10 months undetected). SRE Agent investigation excellent (HIGH confidence, 7 evidence blocks, cross-service correlation). Engineering fix needed urgently — next cert expires June 12. Manual rebinding is a temporary workaround only. Work Items:To Do Bug - Fix TM hostname cert refresh in ApiServiceOrchestrationBase.csTo Do Alert on repeated UpdateSkuV2ServiceFailedDueToInvalidInput | |
| INC 800967470 Sev2 Active Emerging Issue Title: AI Foundry API import via the Azure portal is broken DRI: Alexander Zaslonov (alzaslon) Created: 2026-05-19 Mitigated: 2026-05-21 08:22 UTC Impact: AI Foundry portal integration | 1. Customer ComplaintAI Foundry API import functionality via the Azure portal is broken. Users unable to import APIs through the portal flow. 2. What customer sawAzure portal API import flow for AI Foundry failing. Emerging issue impacting portal-based API onboarding. 3. CSS telemetryFiled as emerging issue. Portal flow broken for AI Foundry API import. 4. Diagnosis & FixEmerging Issue. AI Foundry portal integration broken. Mitigated 2026-05-21. 5. Monitoring Miss?TBD. 6. RepairsTBD. | Feedback: (placeholder) Work Items:TBD | |
| INC 51000001033443 Sev2 Mitigated Title: Consumption SKU APIM, app service platform down DRI: Gleb Feoktistov (glfeokti) / srajagrawal Created: 2026-05-22 Mitigated: 2026-05-22 17:36 UTC TTM: 5h 16m Service: czm140-cur-prd-inc-apim (Consumption, East Asia) | 1. Customer ComplaintCustomer reported their Consumption SKU APIM service (czm140-cur-prd-inc-apim) was completely down. All API requests returning HTTP 503 "service is unavailable." Live production outage causing business disruption with SLA impact. 2. What customer sawAll API calls to public endpoint returned HTTP 503 "The service is unavailable" instead of normal responses. Browsing the base URL returned 503 instead of expected 404. 3. CSS telemetryCSS confirmed zero gateway request logs since ~23:20 UTC on May 21 via Kusto on wawseas ApiGatewayRequest. AppLens showed the underlying web app was down since 23:20 UTC May 21, with availability at 93.07%. 503 pattern pointed to App Service platform issue rather than APIM gateway code. 4. Diagnosis & FixSAS URI invalidation after global storage key rotation. App Service web app entered crash-loop starting ~23:20 UTC May 21. DynamicCache logs showed repeated HTTP 403 from blob storage when downloading gateway package ZIP via WEBSITE_USE_ZIP. Root cause: global storage secrets rotated May 21, invalidating SAS URI for this pinned Consumption service. Could not update via normal RP channels due to known regression ("Unable to update/remove Consumption pinned version"). Fix: Geneva Action to unpin version + manually updated SAS URI. 5. Monitoring Miss?Yes. No LSI filed. Storage rotation invalidated SAS URI causing crash-loop for 13+ hours before customer reported. No monitor detected the failure. 6. Repairs• Geneva Action to unpin Consumption gateway version (only remaining pinned service) | Feedback: (placeholder) Work Items:Fix Consumption pinned version regression; storage key rotation propagation | |
| INC 21000001036015 Sev2 Mitigated Sev A — Live Site Title: APIM Gateway Down – ps-prod-be-euw-apim-manageprotect2 (West Europe) DRI: Gleb Feoktistov (glfeokti) Created: 2026-05-23 Mitigated: 2026-05-23 22:23 UTC TTM: 1h 0m Service: ps-prod-be-euw-apim-manageprotect2 (Premium, Internal VNet, West Europe) Impact: Total outage ~12:47–22:23 UTC (~9.5h customer impact) | 1. Customer ComplaintTotal outage of APIM service (ps-prod-be-euw-apim-manageprotect2, Premium SKU, Internal VNet, West Europe). Resource Health event: "Your API Management service is down due to an unknown reason." APIM inbound endpoint failing, all production workloads fully impacted. Customer confirmed no changes on their side. 2. What customer sawAPIM gateway completely unavailable. Inbound API endpoint stopped responding. Resource Health event in Azure portal with message "service is down due to an unknown reason." Management endpoint also inaccessible. 3. CSS telemetryPlatform Availability 34.71% over 12h window. First drop to 0% at ~12:45 UTC. At 20:01:59 UTC: VMExtensionProvisioningError — "ApimBootstrapperService timed out starting after 6 retries." VMSS instances 16 and 7 failed with DSCConfiguration errors. Rolling upgrade: "100% of instances unhealthy after upgrade." SRE Agent confirmed traffic dropped to zero after 12:30 UTC; all instances in HostStartFailed loops with SSL cert store errors ( 4. Diagnosis & FixHostname orchestration / cert manifest sync issue. VMSS rolling upgrade at 10:27 UTC produced replacement VMs with missing SSL cert store entries. Root cause chain: (1) hostname update orchestration failed after manifest upload at "generate settings" phase, (2) cert manifest out of sync with service config, (3) hostname-to-cert binding failing during bootstrap. Fix: glfeokti removed out-of-sync cert manifest from storage via Geneva Action. Known pattern match: IcM 51000001033991 (North Europe cert store failure after OS update). 5. Monitoring Miss?Yes. Customer impact began ~12:47 UTC but CRI not filed until 21:23 UTC — 8.5 hours gap. No LSI or automated alert filed for Premium service at 0% availability for extended periods. Discovered only via customer support request. 6. Repairs• Root cause: Resource Provider (hostname orchestration / cert manifest sync) | Feedback: (placeholder) Work Items:Systemic fix for cert manifest sync; monitoring for 0% availability | |
| INC 51000001037978 Sev2 Mitigated Sev 1 Escalation — Publix Title: API Management primary instance unresponsive at control plane, unable to scale alternative region DRI: Srajagrawal Created: 2026-05-26 Mitigated: 2026-05-26 21:00 UTC TTM: 6h 4m Customer: Publix Service: cutpapmgdgtlsvcs02 (Premium, Internal VNet, East US 2) | 1. Customer ComplaintPublix reported their Premium APIM instance (cutpapmgdgtlsvcs02) with Internal VNET mode was unresponsive at the control plane in primary region (East US 2). Scale operations to secondary region (Central US) also failing. Problem started ~2026-05-26 14:00 UTC. 2. What customer sawUnable to access management endpoint. APIM scale operations failing with errors. Update ApiService orchestration returned AzureRestCloudException during VMSS operations in eastus2. Unable to scale out to alternative region to recover. 3. CSS telemetryCSS (v-vyarlagadd) found all VMs in East US 2 unhealthy. Update ApiService orchestration failing with AzureRestCloudException during VMSS polling. Kusto: 63,171 DatabaseNotReachable events (Mapi table) + 48,762 (ApiSvcHost table) in 24h. Engineer srajagrawal confirmed spike in HTTP 500 codes and increased gateway latency because primary region down. 4. Diagnosis & FixKey Vault secrets (SQL connection strings) out of sync with service settings. Widespread DatabaseNotReachable errors across all VMs in East US 2. DRI (srajagrawal) attempted VM replacement of gwhost_8416 as initial step. Fix: glfeokti ran Upgrade operation to bring service settings and KV secrets (SQL connection strings) back in sync. 5. Monitoring Miss?Yes (partial). Multiple Sev4 LoadBalancer Probe Unhealthy LSIs filed starting 2026-05-25 19:37 UTC (IcMs 805245267, 805245272, 805245305, etc.) — ~19 hours before CRI. However, only per-VM Sev4 alerts. No service-level Sev2 alert for control plane unresponsive or widespread DatabaseNotReachable. Discovered only via customer report. 6. RepairsNo repair items documented in incident record. | Feedback: (placeholder) Work Items:Service-level alert for DatabaseNotReachable; escalation from Sev4 to Sev2 when multiple VMs unhealthy | |
| INC 51000001039329 Sev2 Mitigated Title: APIM Endpoint with File Transfer Fails when Cached after service upgrade 0.50.x → 0.51.x DRI: Gleb Feoktistov (glfeokti) / Tom Kerkhove (tomkerkhove) Created: 2026-05-27 Mitigated: 2026-05-27 02:13 UTC TTM: ~1h Related: INC 807137164 (Emerging Issue) | 1. Customer ComplaintFile transfer endpoints failing after APIM service upgrade from 0.50.x to 0.51.x. Cached responses returning truncated/corrupted data for file downloads. 2. What customer sawFile transfers through APIM returning incomplete/corrupted data when served from cache. Issue appeared after gateway upgrade to 0.51. 3. CSS telemetryCache truncation at ~2 MiB boundary for responses exceeding cache size limit. Related to BufferingStreamBase.cs bug in 0.51 release. 4. Diagnosis & Fix0.51 cache truncation regression. Same root cause as emerging issue INC 807137164. Partial content cached instead of skipping cache entirely when response exceeds 2 MB limit. Mitigated by service quarantine/rollback. 5. Monitoring Miss?Silent truncation — no observable failure signal. 6. RepairsSee INC 807137164. | Feedback: (placeholder) Work Items:See INC 807137164 | |
| INC 807137164 Sev2 Active Emerging Issue — 0.51 Release Title: Gateway behavior change — backend response >2 MiB with 0.51 Release DRI: Zhongren (zhonren) / Macko Treder (mackotreder) Created: 2026-05-28 Service: apim-uks-prod-shr-1001 (S500) Impact Start: 2026-05-06 Related: INC 21000001017622 | 1. Customer ComplaintMultiple customers reporting cached responses returning truncated data (~2 MB instead of full response) after gateway upgrade to 0.51. S500 customer impact. File transfers and large API responses corrupted. 2. What customer sawResponses that should be >2 MB returned as ~2 MB when served from built-in cache. Silent truncation — no error codes. Only affects built-in cache path; external Redis unaffected. 3. CSS telemetryRegression tied to 0.51 release. ~2 MB truncation boundary. Impact start 2026-05-06. Release halted. Rollback initiated for affected services. 4. Diagnosis & FixBug in 5. Monitoring Miss?Yes. Silent truncation means no observable failure signal. No validation comparing cached vs actual response size. 6. Repairs• Fix BufferingStreamBase.ReadInternalAsync() to clear cacheStream on hitLimit | Feedback: (placeholder) Work Items:Fix pending — release halted | |
| INC 51000001041852 Sev2 Mitigated Title: Scale Out Errors — XP Inc. DRI: Zhongren (zhonren) / Macko Treder (mackotreder) Created: 2026-05-28 Mitigated: 2026-05-29 01:21 UTC TTM: 9h 47m Customer: XP Inc. (XP Investimentos, ACE) Service: xpi-prd-apim (Premium, Brazil South) | 1. Customer ComplaintXP Inc. (XP Investimentos, ACE-level customer) reported multiple scale-out errors on Premium APIM service xpi-prd-apim in Brazil South. Azure Monitor alert fired: 2. What customer sawRepeated scale-out operation failures with 3. CSS telemetryService container showed 4. Diagnosis & FixVMExtensionProvisioningError — DSC (ApimBootstrapperService) timed out on new VMSS instances after 6 retries, blocking RP UpdateRegionSkuOrchestration. Caused VMSS/service-container desync: Azure Autoscale scaled VMSS to 30 VMs (15 units) directly, but service container stayed at AllocatedSkuUnitCount: 3. Each reconciliation failed → oscillation between Failed/Updating. Transferred to Platform for DSC investigation. Incident self-recovered without manual intervention. 5. Monitoring Miss?Yes. No LSI filed for scale-out failures or orchestration loops for xpi-prd-apim within 72h before CRI. Scale-out failure loop ran 5+ hours before customer reported. No platform-side monitoring detected sustained orchestration failure loop. 6. Repairs• DSC failure investigation requested (zhonren asked cojih to investigate bootstrapper timeout) | Feedback: (placeholder) Work Items:DSC investigation; RCA for customer; recurring pattern fix | |
| INC 51000001044102 Sev2 Mitigated Title: Policy update fails with “Policy size exceeds allowed limit of -1 KB” DRI: Zhongren (zhonren) / Macko Treder (mackotreder) Created: 2026-05-29 Mitigated: 2026-05-29 23:12 UTC TTM: 5h 25m Service: apim-jet-stg (Standard v2, Sweden Central) Customer: JetBank Albania / Backbase BVA Blast Radius: Multiple services on scaleunits 003 & 004 | 1. Customer ComplaintJetBank Albania / Backbase BVA completely unable to update any APIM policies on Standard v2 instance (apim-jet-stg) in Sweden Central. All policy updates via portal, az CLI, and REST API failed with HTTP 400: "Policy size exceeds allowed limit of -1 KB." Blocking imminent go-live for new banking platform, 10+ stakeholders blocked, estimated $1M+ financial impact. 2. What customer sawHTTP 400 Bad Request with 3. CSS telemetryHTTP 400 responses in ManagementKpi and HttpIncomingRequests showing the validation error across SMAPI scale units api-sec-prod-scaleunit-003 and 004. Scale-unit-wide configuration issue affecting all 6 active SMAPI instances. Multiple services impacted beyond customer's (apim-jet-stg, apim-dtapim-prd-1wkuu-pv2, apim-apimanager-dt-sc-01, others). Failures first appeared 2026-05-28. Engineer (sasolank) traced regression to DeployApp orchestration on scaleunit-003 ~2026-05-28T12:45 UTC. 4. Diagnosis & FixInteger overflow in entity limit custom settings. 5. Monitoring Miss?Yes. No LSI or alert filed before CRI. No monitoring for invalid (negative) policy size limit configuration or resulting HTTP 400 spike on policy PUT operations. Discovered only via customer support request. 6. Repairs• Root cause: SMAPI — integer overflow in entity limit custom settings | Feedback: (placeholder) Work Items:ACIS-only SMAPI updates; overflow validation guard | |
| INC 51000001033777 Sev2 Resolved MCSAP/ACE Customer Title: Unplanned Schedule upgrade — CDW DRI: Gleb Feoktistov (glfeokti) Created: 2026-05-22 Mitigated: 2026-05-23 02:02 UTC TTM: 4h 31m Customer: CDW (MCSAP/ACE) Service: CDW-USNCZ-NPD-APIM (Premium, North Central US) | 1. Customer ComplaintCDW (MCSAP/ACE account) reported unplanned scheduled upgrade on CDW-USNCZ-NPD-APIM (Premium, North Central US) causing API call failures impacting multiple teams. Customer highly sensitive due to recent bad Azure support experiences. 2. What customer sawAPI calls failing across multiple teams. Traffic collapsed from ~483K req/24hr to 60 requests. Service went from 50% capacity to zero when final pre-upgrade instance (gwhost_7) was replaced ~May 22 17:00 UTC. 3. CSS telemetryPlatform upgrade 0.49→0.50 triggered VMSS rolling upgrade. gwhost_7 succeeded but second VM slot consistently failed: 4. Diagnosis & FixNull-key handling bug in Api.TryUpdateRouting() (v0.50.27283.0). Platform upgrade regression caused gateway config sync failures. CSS followed SRE Agent recommendations. Service rolled back to 0.50 version that restored functionality. 5. Monitoring Miss?Yes. First ConfigInitialSyncFailed at May 21 11:00 UTC — 30 hours before customer reported. Service degraded 2→1→0 instances with 7 consecutive VM failures. No alert fired. 6. Repairs• Fix null-key regression in Api.TryUpdateRouting() (Gateway.Model/Api.cs:674) | Feedback: (placeholder) Work Items:Null-key fix; DSC investigation; config sync alerting | |
| INC 21000001035735 Sev2 Mitigated Title: Cannot change policy rate limit in APIM DRI: Gleb Feoktistov (glfeokti) Created: 2026-05-23 Mitigated: 2026-05-23 11:29 UTC TTM: 3h 43m Service: apim-HubCommon-az-asse-prd-001 (VNet-injected) | 1. Customer ComplaintCustomer unable to change rate limit policy on product scope. Modifying 2. What customer sawPortal showed no error on save but value not persisted — reverted to 60. Customer had owner access. Event logs showed HTTP 500 on save operations. 3. CSS telemetrySRE Agent found SMAPI could not authenticate to SQL: 600,000+ 4. Diagnosis & FixManaged Identity token refresh stopped working. Database connection string missing from service container. MI token no longer being refreshed, preventing SMAPI SQL auth. Fix: Restore database connection string in service container. 5. Monitoring Miss?Yes. SQL auth failure (600K+ errors/hr) ongoing 24+ hours before customer reported. No alert or LSI filed. Discovered only via customer support case. 6. RepairsNo repair items documented. | Feedback: (placeholder) Work Items:MI token refresh monitoring; DatabaseNotReachable alerting | |
| INC 51000001038159 Sev2 Resolved Title: API Management service down — Network connectivity DRI: Gleb Feoktistov (glfeokti) Created: 2026-05-26 Mitigated: 2026-05-26 20:40 UTC TTM: 4h 10m Service: apiRyderDev (Premium, East US) Duration: ~3 days (May 23–26) | 1. Customer ComplaintComplete connectivity loss to Premium APIM service (apiRyderDev, East US) starting ~May 23 05:00 UTC. Service endpoints not enabled for recommended services. Management plane unavailable causing application outage. 2. What customer sawManagement plane completely lost. Portal reported service endpoints not enabled. API traffic dropped to zero. "Apply Network Configuration" resolved display error but did not restore connectivity. Both instances unreachable. 3. CSS telemetryBootstrapper stuck in restart loop. Service upgraded May 21 causing massive 4. Diagnosis & FixFailed platform upgrade + failed automated rollback. Upgrade on May 21 broke gateway (ConfigInitialSyncFailed/null key). Rollback to 0.49 also failed. Both VMs unhealthy for ~3 days. Fix: Manually provisioned new VMs on 0.50 from Azure Portal. 5. Monitoring Miss?Likely Yes. Potentially related LSI (IcM 802390488) filed May 21 but unclear if it covered apiRyderDev specifically. Service impacted ~3 days before CRI filed. No service-specific alert fired. 6. RepairsNo repair items documented. Customer requested RCA. Root cause: Gateway (Managed). | Feedback: (placeholder) Work Items:RCA requested; per-service monitoring for prolonged outage | |
| INC 21000001040966 Sev2 Resolved Title: PremiumV2 APIM Unavailable in UK South DRI: Srajagrawal Created: 2026-05-27 Mitigated: 2026-05-27 15:36 UTC TTM: 5h 12m Customer: S500-level Service: dcw-apim-prod-integration-uks-01 (PremiumV2, UK South) | 1. Customer ComplaintS500 customer tried to create PremiumV2 APIM service (dcw-apim-prod-integration-uks-01) in UK South via Terraform. Error: SKU not available in region. Customer aware of documented temporary limitation, asked when it would be lifted. Blocking their deployment. 2. What customer sawTerraform request rejected: 3. CSS telemetrySRE Agent confirmed PremiumV2 infra IS available in UK South (93 active services, 4 I2v2 resource pools with 820–858 available units). However, PremiumV2 activation telemetry showed 99.4% failure rate (102 successes vs 16,161 failures/90d) — pre-provisioning pipeline constrained since ~Apr 29. Customer's attempt: 0 rows in Orchestration table — blocked at 4. Diagnosis & FixSubscription-level beta feature flag blocking creation. 5. Monitoring Miss?Partial. Related Sev3 LSIs for ActivateSkuV2 unhealthy orchestrations filed prior (IcMs 786658173, 787272459). But those tracked pre-provisioning failures, not per-subscription blocks. Customer-facing creation block not monitored. 6. RepairsNo repair items documented. | Feedback: (placeholder) Work Items:Monitor per-subscription creation blocks; capacity planning for UK South PremiumV2 |
These incidents have already been reviewed. Kept for reference.
| Incident / DRI / Status | Questions & Answers | Feedback & Work Items | Follow-up |
|---|---|---|---|
| May 12 – 1912 incidents · ✓ REVIEWED (May 20) | |||
| April 22 – May 46 incidents · ✓ REVIEWED (May 17) | |||
| April 8 – 1410 Sev2 CRIs · 2 RAs · ✓ REVIEWED | |||
| April 15 – 213 Sev2 CRIs · 2 RAs · ✓ REVIEWED | |||
| May 5 – 111 Sev2 CRI · ✓ REVIEWED | |||
| Reviewed in Meeting — May 95 incidents · ✓ REVIEWED | |||