APIM CRI/RA - Pending Reviews

These incidents have not yet been reviewed. Use the filter to find yours.

Incident / DRI / Status	Questions & Answers	Feedback & Work Items
INC 51000001009580 Sev2 Mitigated Title: Intermittent 500 responses from APIM DRI: Maxim A Created: 2026-05-06 Mitigated: 2026-05-06 18:03 UTC TTM: 9h 37m Customer: myAIS (~10M subscribers) Service: apim-adl-connectivity-hub-jpe-prd (Premium, Japan East)	1. Customer Complaint Customer reported intermittent HTTP 500 GatewayFailure errors with "Unable to connect to the remote server" and transportErrorCode 10048 on their Premium SKU APIM service in Japan East. Impact on API traffic forwarding from APIM to backend services via Internal Load Balancer. Management endpoint remained accessible. Customer platform (myAIS) serves ~10 million subscribers for mobile billing/payments — high impact. 2. What customer saw Requests routed through APIM intermittently failed with HTTP 500. Diagnostic logs: lastError_reason: GatewayFailure, lastError_message: "Unable to connect to the remote server", transportErrorCode: 10048 (WSAEADDRINUSE). Some requests succeeded while others failed. Direct requests to backend without APIM did not consistently reproduce. 3. CSS telemetry CSS identified via ProxyRequest that all HTTP 500s with backendResponseCode=0 originated from a single VM (gwhost_14058). SRE Agent confirmed 3,781 GatewayFailure errors on gwhost_14058 vs zero on all other 23 VMs during 03:00–08:00 UTC window. Faulty VM had 7x higher backend latency (73.8ms vs ~10ms peers) and 21.3% connection failure rate to customer ILB at 172.16.9.5. SNAT port exhaustion ruled out. 4. Diagnosis & Fix VM-level networking degradation on gwhost_14058 causing elevated outbound connection latency, leading to TCP TIME_WAIT accumulation and ephemeral port exhaustion (error 10048/WSAEADDRINUSE). Gradual onset over ~2 hours consistent with port pool saturation. Two contributing factors: (1) delayed scaling from 2 to 24 VMs simultaneously caused traffic processing degradation, (2) single VM behaved abnormally after rapid scale-out. The VMSS model was in an inconsistent state from the failed May 4 rolling upgrade — so new VMs scaled out on May 5 ~23:00 inherited the stale/broken model. Fix: VM replaced via Geneva Action (ManageCompute orchestration apim-adl-connectivity-hub-jpe-prd_ManageCompute_d78f01e7). Last error at 09:37 UTC, no further 500s post-replacement. 5. Monitoring Miss? Yes. No LSI filed before CRI. VM-level degradation did not trigger health monitoring because all NodeHeartbeat and Management checks passed on gwhost_14058. Single-VM ephemeral port exhaustion with elevated backend latency is not currently detected by existing monitors. 6. Repairs No formal repair items documented. SRE Agent noted backend-returned 500s from myais-be.cloud.ais.th (backendResponseCode=500) are a separate customer application issue — recommended clarifying the two distinct error populations to the customer.	Feedback: (placeholder) Work Items: None
INC 51000001019637 Sev2 Resolved Customer Error — Sev2→3 Title: Received invalid status code: 500 DRI: Ondrej Oprala (ondrejoprala) EIM: Maxim Kim SIM: Martin Dechev Created: 2026-05-13 Resolved: 2026-05-14 16:15 UTC Service: T7PAPIMUKS01 (PremiumV2, UK South, VNET-injected) Blast Radius: 3 PremiumV2 services in UK South Duration: Crash loop May 10 → full outage May 12 19:55 UTC	1. Customer Complaint APIM V2 service unreachable. Application Gateway health probe receiving HTTP 500 from APIM backend, resulting in complete service outage for T7PAPIMUKS01 (PremiumV2, UK South, VNET-injected). 2. What customer saw Application Gateway reported "Received invalid status code: 500 in the backend server's HTTP response." All traffic failed — APIM gateway process in persistent crash loop, returning default App Service error page (500) to health probes. 3. CSS telemetry ProxyInfra: 107 consecutive GatewayStartFailed events since May 2. Exception: `Autofac.Core.DependencyResolutionException → QuotaComponent → System.ArgumentException: Connection string parsing error at Azure.Data.Tables.TableConnectionString.Parse()`. Webapp setting `policy.qouta.sync.table.connection` uses KeyVault reference that failed resolution ("Reference was not able to be resolved"). Traffic dropped to zero at May 12 19:55 UTC when last healthy App Service worker recycled. 3 PremiumV2 services in UK South affected simultaneously (T7PAPIMUKS01, apim-core-p-uks, rlg-sbx-apimgmt-uks-apim-v2). 4. Diagnosis & Fix Customer Error — Fortigate NVA blocking outbound TCP to Key Vault/Storage. Root cause: Customer added a default route (0.0.0.0/0 → Fortigate NVA at 192.168.4.6) to the APIM subnet route table on April 20th (change 181503, approved by Nitesh Kumar). The Fortigate silently dropped packets to Key Vault public endpoints — SYN packets retransmitted with no response. This prevented App Service from resolving the `@Microsoft.KeyVault(...)` reference, leaving the gateway with an unparseable connection string on every startup attempt. Richard Cao (networking) confirmed: curl from APIM node (10.58.8.250) to Key Vault timed out. Process-tuples showed traffic correctly delivered to Fortigate VM but silently dropped. Customer’s NSG rules were also initially missing required allowances. Mitigation: Customer added additional routes to bypass Fortigate for Key Vault/Storage traffic → curl to Key Vault succeeded → webapp restarted → gateway recovered. 5. Monitoring Miss? Partial. Gateway crash loop ran for 3 days (May 10–12) before complete outage. No alert for “all new worker startups failing” pattern. However, root cause is customer-controlled network configuration — APIM cannot monitor customer NVA behavior. 6. Repairs • SRE Agent improvement: new known pattern “SKUv2 Gateway Crash Loop — NVA/Firewall Blocking Dependencies” added (PR #2298) to prevent misattributing to SAS expiration • Agent fix: ForceRuntimeUpgrade does not work for PremiumV2 (resolves SKUv1 package path) • No APIM platform repair needed — customer network misconfiguration	Feedback: Good collaborative investigation (Ondrej, Nima, Richard Cao, Jarod Aerts, Oliviu). SRE Agent initially misattributed to SAS expiration (MEDIUM confidence noted). Agent also exceeded posting budget (4+ posts vs 2). Triage learning PR #2298 filed to prevent recurrence of misdiagnosis. Work Items: Done PR 2298 - New known CRI pattern for NVA/firewall blocking
INC 51000001023424 Sev2 Mitigated Platform Bug — SKUv2 Orchestration Title: APIM not refreshing certificate for Traffic Manager custom domains DRI: Javier Borrego (javierbo) SIM: Alexander Zaslonov EIM: Maxim Kim Created: 2026-05-15 Mitigated: 2026-05-15 23:04 UTC Duration: 2h 34m Service: apim-cpgapi-prod-eus & apim-cpgapi-prod-wus (BasicV2) Customer: Internal Microsoft (CE-EA-LM-CPG) Risk: Cert expires June 12 — TLS will break if not permanently fixed	1. Customer Complaint APIM service not refreshing the certificate for the Traffic Manager profile under custom domains. Certificate for `contactpermissionsgatewayapi.trafficmanager.net` not being auto-renewed. Affects both East US and West US prod services. 2. What customer saw Certificate refresh failing silently. No TLS disruption yet (existing cert still valid), but most recent App Service cert expiring June 12 — imminent risk. Issue persisted since July 2024 with 10 accumulated expired certs. 3. CSS telemetry SRE Agent identified root cause with HIGH confidence from Orchestration telemetry: • `UpdateSkuV2ServiceFailedDueToInvalidInput` recurring every ~6h (WUS) and ~20h (EUS) since May 6 • StatusMessage: "The traffic manager domain can be removed only through the Traffic Manager" • 5 retries with exponential backoff (10s→160s), all permanently failing • 37 failures on WUS + 12 on EUS in 10-day window • 10 stale certs accumulated on App Service dating back to July 2024 • Valid KeyVault cert (thumbprint C84F343B, expiry 2026-08-25) loaded by RP but cannot be bound 4. Diagnosis & Fix Platform Bug — SKUv2 orchestration cannot handle Traffic Manager hostname bindings. The `UpdateSkuV2Service` orchestration calls `DeleteHostNameBindingAsync()` (ApiServiceOrchestrationBase.cs L4465-4475) to remove the old binding before re-creating with the updated cert. App Service rejects the DELETE because `.trafficmanager.net` bindings can only be managed through the Traffic Manager resource, not the web app hostNameBindings API. The RP treats this permanent constraint as transient (retries 5x), fails, and commits state with `RecentlyAbortedUpdateHostname=true`. Mitigation:* Javier allowed new cert to sync into App Service and manually rebound the binding without deleting it, bypassing the failing delete/recreate flow. Validated production endpoint serving updated G2 certificate on bridge. 5. Monitoring Miss? Yes. Issue existed since July 2024 (~10 months) with no alert. `UpdateSkuV2ServiceFailedDueToInvalidInput` events generated continuously but not monitored/alerted. Only surfaced when customer noticed cert approaching expiry. 6. Repairs • URGENT Bug: Fix `ApiServiceOrchestrationBase.cs` L4465-4475 to skip `DeleteHostNameBindingAsync()` for `*.trafficmanager.net` hostnames and update cert in-place (PUT with new thumbprint) • Alert on repeated `UpdateSkuV2ServiceFailedDueToInvalidInput` events • Cleanup 9 expired App Service certs on both web apps • Deadline: June 12, 2026 — if not permanently fixed, TLS breaks again	Feedback: Long-standing platform bug (~10 months undetected). SRE Agent investigation excellent (HIGH confidence, 7 evidence blocks, cross-service correlation). Engineering fix needed urgently — next cert expires June 12. Manual rebinding is a temporary workaround only. Work Items: To Do Bug - Fix TM hostname cert refresh in ApiServiceOrchestrationBase.csTo Do Alert on repeated UpdateSkuV2ServiceFailedDueToInvalidInput
INC 800967470 Sev2 Active Emerging Issue Title: AI Foundry API import via the Azure portal is broken DRI: Alexander Zaslonov (alzaslon) Created: 2026-05-19 Mitigated: 2026-05-21 08:22 UTC Impact: AI Foundry portal integration	1. Customer Complaint AI Foundry API import functionality via the Azure portal is broken. Users unable to import APIs through the portal flow. 2. What customer saw Azure portal API import flow for AI Foundry failing. Emerging issue impacting portal-based API onboarding. 3. CSS telemetry Filed as emerging issue. Portal flow broken for AI Foundry API import. 4. Diagnosis & Fix Emerging Issue. AI Foundry portal integration broken. Mitigated 2026-05-21. 5. Monitoring Miss? TBD. 6. Repairs TBD.	Feedback: (placeholder) Work Items: TBD
INC 51000001033443 Sev2 Mitigated Title: Consumption SKU APIM, app service platform down DRI: Gleb Feoktistov (glfeokti) / srajagrawal Created: 2026-05-22 Mitigated: 2026-05-22 17:36 UTC TTM: 5h 16m Service: czm140-cur-prd-inc-apim (Consumption, East Asia)	1. Customer Complaint Customer reported their Consumption SKU APIM service (czm140-cur-prd-inc-apim) was completely down. All API requests returning HTTP 503 "service is unavailable." Live production outage causing business disruption with SLA impact. 2. What customer saw All API calls to public endpoint returned HTTP 503 "The service is unavailable" instead of normal responses. Browsing the base URL returned 503 instead of expected 404. 3. CSS telemetry CSS confirmed zero gateway request logs since ~23:20 UTC on May 21 via Kusto on wawseas ApiGatewayRequest. AppLens showed the underlying web app was down since 23:20 UTC May 21, with availability at 93.07%. 503 pattern pointed to App Service platform issue rather than APIM gateway code. 4. Diagnosis & Fix SAS URI invalidation after global storage key rotation. App Service web app entered crash-loop starting ~23:20 UTC May 21. DynamicCache logs showed repeated HTTP 403 from blob storage when downloading gateway package ZIP via WEBSITE_USE_ZIP. Root cause: global storage secrets rotated May 21, invalidating SAS URI for this pinned Consumption service. Could not update via normal RP channels due to known regression ("Unable to update/remove Consumption pinned version"). Fix: Geneva Action to unpin version + manually updated SAS URI. 5. Monitoring Miss? Yes. No LSI filed. Storage rotation invalidated SAS URI causing crash-loop for 13+ hours before customer reported. No monitor detected the failure. 6. Repairs • Geneva Action to unpin Consumption gateway version (only remaining pinned service) • Known regression: "Unable to update/remove Consumption pinned version" must be resolved • Systemic fix needed: storage key rotation must propagate to all active service packages	Feedback: (placeholder) Work Items: Fix Consumption pinned version regression; storage key rotation propagation
INC 21000001036015 Sev2 Mitigated Sev A — Live Site Title: APIM Gateway Down – ps-prod-be-euw-apim-manageprotect2 (West Europe) DRI: Gleb Feoktistov (glfeokti) Created: 2026-05-23 Mitigated: 2026-05-23 22:23 UTC TTM: 1h 0m Service: ps-prod-be-euw-apim-manageprotect2 (Premium, Internal VNet, West Europe) Impact: Total outage ~12:47–22:23 UTC (~9.5h customer impact)	1. Customer Complaint Total outage of APIM service (ps-prod-be-euw-apim-manageprotect2, Premium SKU, Internal VNet, West Europe). Resource Health event: "Your API Management service is down due to an unknown reason." APIM inbound endpoint failing, all production workloads fully impacted. Customer confirmed no changes on their side. 2. What customer saw APIM gateway completely unavailable. Inbound API endpoint stopped responding. Resource Health event in Azure portal with message "service is down due to an unknown reason." Management endpoint also inaccessible. 3. CSS telemetry Platform Availability 34.71% over 12h window. First drop to 0% at ~12:45 UTC. At 20:01:59 UTC: VMExtensionProvisioningError — "ApimBootstrapperService timed out starting after 6 retries." VMSS instances 16 and 7 failed with DSCConfiguration errors. Rolling upgrade: "100% of instances unhealthy after upgrade." SRE Agent confirmed traffic dropped to zero after 12:30 UTC; all instances in HostStartFailed loops with SSL cert store errors (`netsh http show sslcert ExitCode=1`). 4. Diagnosis & Fix Hostname orchestration / cert manifest sync issue. VMSS rolling upgrade at 10:27 UTC produced replacement VMs with missing SSL cert store entries. Root cause chain: (1) hostname update orchestration failed after manifest upload at "generate settings" phase, (2) cert manifest out of sync with service config, (3) hostname-to-cert binding failing during bootstrap. Fix: glfeokti removed out-of-sync cert manifest from storage via Geneva Action. Known pattern match: IcM 51000001033991 (North Europe cert store failure after OS update). 5. Monitoring Miss? Yes. Customer impact began ~12:47 UTC but CRI not filed until 21:23 UTC — 8.5 hours gap. No LSI or automated alert filed for Premium service at 0% availability for extended periods. Discovered only via customer support request. 6. Repairs • Root cause: Resource Provider (hostname orchestration / cert manifest sync) • Known recurring pattern (IcM 51000001033991) — needs systemic fix • No explicit repair items created in incident record	Feedback: (placeholder) Work Items: Systemic fix for cert manifest sync; monitoring for 0% availability
INC 51000001037978 Sev2 Mitigated Sev 1 Escalation — Publix Title: API Management primary instance unresponsive at control plane, unable to scale alternative region DRI: Srajagrawal Created: 2026-05-26 Mitigated: 2026-05-26 21:00 UTC TTM: 6h 4m Customer: Publix Service: cutpapmgdgtlsvcs02 (Premium, Internal VNet, East US 2)	1. Customer Complaint Publix reported their Premium APIM instance (cutpapmgdgtlsvcs02) with Internal VNET mode was unresponsive at the control plane in primary region (East US 2). Scale operations to secondary region (Central US) also failing. Problem started ~2026-05-26 14:00 UTC. 2. What customer saw Unable to access management endpoint. APIM scale operations failing with errors. Update ApiService orchestration returned AzureRestCloudException during VMSS operations in eastus2. Unable to scale out to alternative region to recover. 3. CSS telemetry CSS (v-vyarlagadd) found all VMs in East US 2 unhealthy. Update ApiService orchestration failing with AzureRestCloudException during VMSS polling. Kusto: 63,171 DatabaseNotReachable events (Mapi table) + 48,762 (ApiSvcHost table) in 24h. Engineer srajagrawal confirmed spike in HTTP 500 codes and increased gateway latency because primary region down. 4. Diagnosis & Fix Key Vault secrets (SQL connection strings) out of sync with service settings. Widespread DatabaseNotReachable errors across all VMs in East US 2. DRI (srajagrawal) attempted VM replacement of gwhost_8416 as initial step. Fix: glfeokti ran Upgrade operation to bring service settings and KV secrets (SQL connection strings) back in sync. 5. Monitoring Miss? Yes (partial). Multiple Sev4 LoadBalancer Probe Unhealthy LSIs filed starting 2026-05-25 19:37 UTC (IcMs 805245267, 805245272, 805245305, etc.) — ~19 hours before CRI. However, only per-VM Sev4 alerts. No service-level Sev2 alert for control plane unresponsive or widespread DatabaseNotReachable. Discovered only via customer report. 6. Repairs No repair items documented in incident record.	Feedback: (placeholder) Work Items: Service-level alert for DatabaseNotReachable; escalation from Sev4 to Sev2 when multiple VMs unhealthy
INC 51000001039329 Sev2 Mitigated Title: APIM Endpoint with File Transfer Fails when Cached after service upgrade 0.50.x → 0.51.x DRI: Gleb Feoktistov (glfeokti) / Tom Kerkhove (tomkerkhove) Created: 2026-05-27 Mitigated: 2026-05-27 02:13 UTC TTM: ~1h Related: INC 807137164 (Emerging Issue)	1. Customer Complaint File transfer endpoints failing after APIM service upgrade from 0.50.x to 0.51.x. Cached responses returning truncated/corrupted data for file downloads. 2. What customer saw File transfers through APIM returning incomplete/corrupted data when served from cache. Issue appeared after gateway upgrade to 0.51. 3. CSS telemetry Cache truncation at ~2 MiB boundary for responses exceeding cache size limit. Related to BufferingStreamBase.cs bug in 0.51 release. 4. Diagnosis & Fix 0.51 cache truncation regression. Same root cause as emerging issue INC 807137164. Partial content cached instead of skipping cache entirely when response exceeds 2 MB limit. Mitigated by service quarantine/rollback. 5. Monitoring Miss? Silent truncation — no observable failure signal. 6. Repairs See INC 807137164.	Feedback: (placeholder) Work Items: See INC 807137164
INC 807137164 Sev2 Active Emerging Issue — 0.51 Release Title: Gateway behavior change — backend response >2 MiB with 0.51 Release DRI: Zhongren (zhonren) / Macko Treder (mackotreder) Created: 2026-05-28 Service: apim-uks-prod-shr-1001 (S500) Impact Start: 2026-05-06 Related: INC 21000001017622	1. Customer Complaint Multiple customers reporting cached responses returning truncated data (~2 MB instead of full response) after gateway upgrade to 0.51. S500 customer impact. File transfers and large API responses corrupted. 2. What customer saw Responses that should be >2 MB returned as ~2 MB when served from built-in cache. Silent truncation — no error codes. Only affects built-in cache path; external Redis unaffected. 3. CSS telemetry Regression tied to 0.51 release. ~2 MB truncation boundary. Impact start 2026-05-06. Release halted. Rollback initiated for affected services. 4. Diagnosis & Fix Bug in `BufferingStreamBase.ReadInternalAsync()`: When response exceeds 2 MB cache limit, `hitLimit = true` is set but `cacheStream` is NOT cleared — retains partial bytes. `OnCompleted()` then caches the partial content. Subsequent requests get truncated cached response. Why only 0.51: v0.49 consumed response in large enough reads that first read exceeded limit (cacheStream stayed empty). v0.51 changed HTTP client streaming to smaller chunks, accumulating partial data before limit triggers. Fix: Null/dispose `cacheStream` when `hitLimit` set. Release halted, rollback ACIS issued for SKUv1 and SKUv2. 5. Monitoring Miss? Yes. Silent truncation means no observable failure signal. No validation comparing cached vs actual response size. 6. Repairs • Fix BufferingStreamBase.ReadInternalAsync() to clear cacheStream on hitLimit • Add cache integrity validation • Release gate for cache-size boundary testing	Feedback: (placeholder) Work Items: Fix pending — release halted
INC 51000001041852 Sev2 Mitigated Title: Scale Out Errors — XP Inc. DRI: Zhongren (zhonren) / Macko Treder (mackotreder) Created: 2026-05-28 Mitigated: 2026-05-29 01:21 UTC TTM: 9h 47m Customer: XP Inc. (XP Investimentos, ACE) Service: xpi-prd-apim (Premium, Brazil South)	1. Customer Complaint XP Inc. (XP Investimentos, ACE-level customer) reported multiple scale-out errors on Premium APIM service xpi-prd-apim in Brazil South. Azure Monitor alert fired: `Microsoft.ApiManagement/service/write` failed with "Unable to Update API service with vnet injection at this time" (ResourceOperationFailure). Autoscale stuck in failure loop. 3rd recent similar event with high executive visibility. 2. What customer saw Repeated scale-out operation failures with `ResourceOperationFailure`: "Unable to Update API service with vnet injection at this time." Number of Machines metric showed 30 machines allocated, yet service continued attempting and failing to scale out. Azure Monitor alert fired at 2026-05-28T11:51:05Z. 3. CSS telemetry Service container showed `AllocatedSkuUnitCount: 3` (frozen) despite 30 healthy VMs running. Orchestration table: continuous loop of ScaleVmScaleSetInRegionFailed, UpdateOrchestrationFailedToChangeDeployedSku, UpdateRegionSkuOrchestrationFailed. SRE Agent confirmed 32 scale failures across 16 correlation IDs over ~5.3h (10:00–15:21 UTC). Capacity: 15 units / 30 machines allocated and healthy, but orchestration could not reconcile. Data plane unaffected (~10–20M req/hr normal). 4. Diagnosis & Fix VMExtensionProvisioningError — DSC (ApimBootstrapperService) timed out on new VMSS instances after 6 retries, blocking RP UpdateRegionSkuOrchestration. Caused VMSS/service-container desync: Azure Autoscale scaled VMSS to 30 VMs (15 units) directly, but service container stayed at AllocatedSkuUnitCount: 3. Each reconciliation failed → oscillation between Failed/Updating. Transferred to Platform for DSC investigation. Incident self-recovered without manual intervention. 5. Monitoring Miss? Yes. No LSI filed for scale-out failures or orchestration loops for xpi-prd-apim within 72h before CRI. Scale-out failure loop ran 5+ hours before customer reported. No platform-side monitoring detected sustained orchestration failure loop. 6. Repairs • DSC failure investigation requested (zhonren asked cojih to investigate bootstrapper timeout) • RCA requested by customer (3rd recurrence, high exec visibility) • Prior similar: IcM 51000000995053 • Root cause: Service/VM Issue	Feedback: (placeholder) Work Items: DSC investigation; RCA for customer; recurring pattern fix
INC 51000001044102 Sev2 Mitigated Title: Policy update fails with “Policy size exceeds allowed limit of -1 KB” DRI: Zhongren (zhonren) / Macko Treder (mackotreder) Created: 2026-05-29 Mitigated: 2026-05-29 23:12 UTC TTM: 5h 25m Service: apim-jet-stg (Standard v2, Sweden Central) Customer: JetBank Albania / Backbase BVA Blast Radius: Multiple services on scaleunits 003 & 004	1. Customer Complaint JetBank Albania / Backbase BVA completely unable to update any APIM policies on Standard v2 instance (apim-jet-stg) in Sweden Central. All policy updates via portal, az CLI, and REST API failed with HTTP 400: "Policy size exceeds allowed limit of -1 KB." Blocking imminent go-live for new banking platform, 10+ stakeholders blocked, estimated $1M+ financial impact. 2. What customer saw HTTP 400 Bad Request with `ValidationError`: "Policy size exceeds allowed limit of -1 KB" on all policy updates (product-level, API-level, global) — even minimal single-header changes. All update methods blocked (portal, CLI, REST API). Policy fragment updates continued working (HTTP 200 OK). 3. CSS telemetry HTTP 400 responses in ManagementKpi and HttpIncomingRequests showing the validation error across SMAPI scale units api-sec-prod-scaleunit-003 and 004. Scale-unit-wide configuration issue affecting all 6 active SMAPI instances. Multiple services impacted beyond customer's (apim-jet-stg, apim-dtapim-prd-1wkuu-pv2, apim-apimanager-dt-sc-01, others). Failures first appeared 2026-05-28. Engineer (sasolank) traced regression to DeployApp orchestration on scaleunit-003 ~2026-05-28T12:45 UTC. 4. Diagnosis & Fix Integer overflow in entity limit custom settings. `GetMaxPolicySize()` reads `LimitsMaxPolicySizeKb` and multiplies by 1024. A deployment set value to 2147483645, which × 1024 caused integer overflow wrapping policy size limit to -1 KB. Fix: Updated custom settings across all affected scale units (001–004) to set `Microsoft.WindowsAzure.ApiManagement.Mapi.Limits.Entities.Policies.SizeKb` to "2000000" (no overflow), then restarted webapps. 5. Monitoring Miss? Yes. No LSI or alert filed before CRI. No monitoring for invalid (negative) policy size limit configuration or resulting HTTP 400 spike on policy PUT operations. Discovered only via customer support request. 6. Repairs • Root cause: SMAPI — integer overflow in entity limit custom settings • Process improvement: "Use ACIS action to update SMAPI instead of portal" to avoid overflow-prone values • No formal repair items documented beyond immediate fix	Feedback: (placeholder) Work Items: ACIS-only SMAPI updates; overflow validation guard
INC 51000001033777 Sev2 Resolved MCSAP/ACE Customer Title: Unplanned Schedule upgrade — CDW DRI: Gleb Feoktistov (glfeokti) Created: 2026-05-22 Mitigated: 2026-05-23 02:02 UTC TTM: 4h 31m Customer: CDW (MCSAP/ACE) Service: CDW-USNCZ-NPD-APIM (Premium, North Central US)	1. Customer Complaint CDW (MCSAP/ACE account) reported unplanned scheduled upgrade on CDW-USNCZ-NPD-APIM (Premium, North Central US) causing API call failures impacting multiple teams. Customer highly sensitive due to recent bad Azure support experiences. 2. What customer saw API calls failing across multiple teams. Traffic collapsed from ~483K req/24hr to 60 requests. Service went from 50% capacity to zero when final pre-upgrade instance (gwhost_7) was replaced ~May 22 17:00 UTC. 3. CSS telemetry Platform upgrade 0.49→0.50 triggered VMSS rolling upgrade. gwhost_7 succeeded but second VM slot consistently failed: `VMExtensionProvisioningTimeout` on DSCConfiguration. 7 replacement VMs over 30h, each failing identically. Gateway Agent found `ConfigInitialSyncFailed` with `ArgumentNullException` in `Api.TryUpdateRouting()` — null routing key from API revision (function-app-mcp-server;rev=1). 16,000+ sync failures. 4. Diagnosis & Fix Null-key handling bug in Api.TryUpdateRouting() (v0.50.27283.0). Platform upgrade regression caused gateway config sync failures. CSS followed SRE Agent recommendations. Service rolled back to 0.50 version that restored functionality. 5. Monitoring Miss? Yes. First ConfigInitialSyncFailed at May 21 11:00 UTC — 30 hours before customer reported. Service degraded 2→1→0 instances with 7 consecutive VM failures. No alert fired. 6. Repairs • Fix null-key regression in Api.TryUpdateRouting() (Gateway.Model/Api.cs:674) • Fix function-app-mcp-server;rev=1 API null Method/route • Investigate VNET NSG/firewall blocking DSC extension • Add alerting for persistent ConfigInitialSyncFailed and single-instance degradation	Feedback: (placeholder) Work Items: Null-key fix; DSC investigation; config sync alerting
INC 21000001035735 Sev2 Mitigated Title: Cannot change policy rate limit in APIM DRI: Gleb Feoktistov (glfeokti) Created: 2026-05-23 Mitigated: 2026-05-23 11:29 UTC TTM: 3h 43m Service: apim-HubCommon-az-asse-prd-001 (VNet-injected)	1. Customer Complaint Customer unable to change rate limit policy on product scope. Modifying `rate-limit-by-key` from 60 to 80 calls appeared to succeed but silently reverted to 60 upon re-opening. 2. What customer saw Portal showed no error on save but value not persisted — reverted to 60. Customer had owner access. Event logs showed HTTP 500 on save operations. 3. CSS telemetry SRE Agent found SMAPI could not authenticate to SQL: 600,000+ `DatabaseNotReachable` events/hour with `SqlException: Login failed for user ''` (empty username, SQL Error 18456). Failure ongoing 24+ hours (since ~May 22 08:00 UTC). Gateway data plane unaffected — all 16 instances serving 2–3M req/hr normally. 4. Diagnosis & Fix Managed Identity token refresh stopped working. Database connection string missing from service container. MI token no longer being refreshed, preventing SMAPI SQL auth. Fix: Restore database connection string in service container. 5. Monitoring Miss? Yes. SQL auth failure (600K+ errors/hr) ongoing 24+ hours before customer reported. No alert or LSI filed. Discovered only via customer support case. 6. Repairs No repair items documented.	Feedback: (placeholder) Work Items: MI token refresh monitoring; DatabaseNotReachable alerting
INC 51000001038159 Sev2 Resolved Title: API Management service down — Network connectivity DRI: Gleb Feoktistov (glfeokti) Created: 2026-05-26 Mitigated: 2026-05-26 20:40 UTC TTM: 4h 10m Service: apiRyderDev (Premium, East US) Duration: ~3 days (May 23–26)	1. Customer Complaint Complete connectivity loss to Premium APIM service (apiRyderDev, East US) starting ~May 23 05:00 UTC. Service endpoints not enabled for recommended services. Management plane unavailable causing application outage. 2. What customer saw Management plane completely lost. Portal reported service endpoints not enabled. API traffic dropped to zero. "Apply Network Configuration" resolved display error but did not restore connectivity. Both instances unreachable. 3. CSS telemetry Bootstrapper stuck in restart loop. Service upgraded May 21 causing massive `ConfigInitialSyncFailed` (0→~1,992/day) with `ArgumentNullException: key at Dictionary.FindEntry → Api.TryUpdateRouting()`. Automated rollback to 0.49 also failed (UpgradeOrchestrationFailedToRollBack). Both VMs unhealthy, DSC extension timed out. 4. Diagnosis & Fix Failed platform upgrade + failed automated rollback. Upgrade on May 21 broke gateway (ConfigInitialSyncFailed/null key). Rollback to 0.49 also failed. Both VMs unhealthy for ~3 days. Fix: Manually provisioned new VMs on 0.50 from Azure Portal. 5. Monitoring Miss? Likely Yes. Potentially related LSI (IcM 802390488) filed May 21 but unclear if it covered apiRyderDev specifically. Service impacted ~3 days before CRI filed. No service-specific alert fired. 6. Repairs No repair items documented. Customer requested RCA. Root cause: Gateway (Managed).	Feedback: (placeholder) Work Items: RCA requested; per-service monitoring for prolonged outage
INC 21000001040966 Sev2 Resolved Title: PremiumV2 APIM Unavailable in UK South DRI: Srajagrawal Created: 2026-05-27 Mitigated: 2026-05-27 15:36 UTC TTM: 5h 12m Customer: S500-level Service: dcw-apim-prod-integration-uks-01 (PremiumV2, UK South)	1. Customer Complaint S500 customer tried to create PremiumV2 APIM service (dcw-apim-prod-integration-uks-01) in UK South via Terraform. Error: SKU not available in region. Customer aware of documented temporary limitation, asked when it would be lifted. Blocking their deployment. 2. What customer saw Terraform request rejected: `ApiServiceCreationDisabledForSubscription` — "Creation of new PremiumV2 API Management services in UK South is not available at the moment." Request never reached RP orchestration — blocked at ARM validation. 3. CSS telemetry SRE Agent confirmed PremiumV2 infra IS available in UK South (93 active services, 4 I2v2 resource pools with 820–858 available units). However, PremiumV2 activation telemetry showed 99.4% failure rate (102 successes vs 16,161 failures/90d) — pre-provisioning pipeline constrained since ~Apr 29. Customer's attempt: 0 rows in Orchestration table — blocked at `MaxApimServicesCountPerSkuPerSubscription` beta feature flag (PremiumV2 = 0). 4. Diagnosis & Fix Subscription-level beta feature flag blocking creation. `MaxApimServicesCountPerSkuPerSubscription` had PremiumV2 set to 0 for customer's subscription. Fix: srajagrawal whitelisted 3 customer subscriptions to allow 1 PremiumV2 each. Customer confirmed successful deploy. 5. Monitoring Miss? Partial. Related Sev3 LSIs for ActivateSkuV2 unhealthy orchestrations filed prior (IcMs 786658173, 787272459). But those tracked pre-provisioning failures, not per-subscription blocks. Customer-facing creation block not monitored. 6. Repairs No repair items documented.	Feedback: (placeholder) Work Items: Monitor per-subscription creation blocks; capacity planning for UK South PremiumV2

These incidents have already been reviewed. Kept for reference.

Incident / DRI / Status	Questions & Answers	Feedback & Work Items
May 12 – 1912 incidents · ✓ REVIEWED (May 20)
INC 796496317 Sev2 Mitigated By Design Title: Capacity Exception Request UK South PremiumV2 (DVSA) DRI: Shubham (shubhash) Created: 2026-05-12 Mitigated: 2026-05-12 17:06 UTC TTM: 1h 55m	1. Customer Complaint Driver and Vehicle Standards Agency requested PremiumV2 whitelisting in UK South for 5 subscriptions (dev/sit/uat/preprod/prod). 2. What customer saw Unable to deploy PremiumV2 in UK South without capacity exception. 3. CSS telemetry Standard capacity exception request via CSS template. 4. Diagnosis & Fix By Design. Customer directed to use proper capacity request template. Mitigated by shubhash. 5. Monitoring Miss? No — process issue, not product issue. 6. Repairs N/A.	Feedback: (placeholder) Work Items: None
INC 51000000991822 Sev2 Mitigated Title: MCP tools schema changed DRI: Omar / Ajinkya / Nicholas Created: 2026-04-21 Mitigated: 2026-04-24	1. Customer Complaint MCP tools/list response schema changed, breaking customer integration. Nested objects in schema being flattened — e.g., expected `{"location": {"type": "string"}}` but received `{"location_type": "string"}`. 2. What customer saw Different payload structure post-upgrade. MCP tool definitions returned flattened properties instead of nested objects. 3. CSS telemetry Schema mismatch confirmed. CSS engineer (tehnoonr) independently identified as "a regression in MCP server implementation with the latest build — nested objects in the schema are being flattened." Upgrade telemetry confirmed version change on 2026-04-17 at 00:58:35 UTC. 4. Diagnosis & Fix Breaking change in v0.51.3757.0 — two bugs: (1) Intentional schema unwrapping in `OperationExtensions.GetInputSchema()` (2) Body reconstruction bug in `InvokeToolHandler.ExecuteMethodAsync()` where newPayload is overwritten per iteration. Rollback to v0.50.3674.0 + service quarantine. Ashendre mitigated Apr 24. 5. Monitoring Miss? Yes — no API contract tests to detect breaking schema changes in MCP tool definitions before release. 6. Repairs • Schema versioning for MCP tool definitions • Contract tests for MCP tools/list response structure to prevent future regressions	Feedback: (placeholder) Work Items: To Do Schema versioning for MCP tool definitionsTo Do Contract tests for MCP tools/list
INC 51000001002629 Sev2 Mitigated Title: UNIFIED STRATEGIC \| Intermittent connection failure DRI: Ethan Created: 2026-04-30 Mitigated: 2026-04-30 02:43 UTC TTM: 1h 38m Customer: Strategic/Unified Service: enterprise-int-apim-prod (Premium, Central US)	1. Customer Complaint Customer reported intermittent connection failures affecting ~20% of requests from AKS cluster to APIM service "enterprise-int-apim-prod" (Premium SKUv1, Central US). Failing with ECONNREFUSED 10.12.2.228:443 when calling Salesforce API through APIM. Issue began after OS upgrade on 2026-04-29 ~12:00 UTC. 2. What customer saw GraphQL operation failures in AKS app (msol-content-bff): "connect ECONNREFUSED 10.12.2.228:443." IP is the APIM Internal Load Balancer VIP. Path: AKS (separate VNET/sub) → Hub VNET with Palo Alto firewall → APIM Internal VNET. TCP-level RST packets prevented requests from reaching gateway. 3. CSS telemetry CSS queried ProxyRequest for the failing URL and found only HTTP 200/204 — failing requests never reached APIM gateway. CRP telemetry confirmed VirtualMachineScaleSets.AutoOSUpgrade.POST upgrading to Windows Server 2022 image 20348.5020.260413. Network traces showed bidirectional RST packets — firewall saw resets from LB IP 10.12.2.228, APIM nodes saw resets from AKS pod IP 10.17.107.7. 4. Diagnosis & Fix Faulty VM (gwhost_02) after OS upgrade. SRE Agent confirmed all 4 VMs healthy post-upgrade with continuous heartbeats. DRI (ethanlao) broke down ClientConnectionFailure by RoleInstance revealing almost all failures from gwhost_02. ReplaceVM initiated via Geneva Action (enterprise-int-apim-prod_ManageCompute_ead74155). After replacement, ClientConnectionFailures ceased. Mitigated. 5. Monitoring Miss? Yes. No LSI filed before CRI. Per-VM ClientConnectionFailure pattern on gwhost_02 not detected by existing monitoring. DRI feedback: "SRE Agent did not check telemetry by role instance, which clearly shows faulty node." Per-VM error distribution monitoring could have caught this earlier. 6. Repairs Transferred to Platform for RCA. Root cause: "Service/VM Issue." DRI improvement: SRE Agent should check telemetry by RoleInstance to identify faulty nodes. No additional repair items documented.	Review Note: Requires review — faulty VM, moved to platform for RCA Work Items: To Do Per-VM error distribution monitoring
INC 51000001005278 Sev2 Mitigated Title: Requests from APPGW to APIM - 504 Timeout DRI: Ethan Created: 2026-05-01 Mitigated: 2026-05-02 01:08 UTC TTM: 1h 58m Service: apim-prod-eu-01 (Premium, Internal VNET, WEU)	1. Customer Complaint Customer reported intermittent 504 timeout errors from their Application Gateway (agw-apim-prod-eu-01-pri-int) when making requests to APIM service (apim-prod-eu-01) in West Europe. Issue began ~2026-04-28T23:18 UTC. Failing calls never reached APIM. 2. What customer saw Intermittent HTTP 504 (Gateway Timeout) from Application Gateway routing to Premium multi-region internal VNet APIM. Successful requests worked normally. Failed requests returned 504 with no entries in APIM GatewayLogs — requests did not reach APIM. Issue specific to West Europe region. 3. CSS telemetry CSS confirmed no ProxyRequest entries for failed requests (only 200/204 for successful ones). P97 latency under 400ms (no backend slowness). Error reasons >99% empty. However, identified increase in ClientConnectionFailure errors correlating with problem start and unhealthy patterns on APIM Internal Load Balancer (ILB) beginning when customer issue appeared. 4. Diagnosis & Fix Faulty VM (gwhost_33). Customer-reported issue start time (2026-04-28 23:15 UTC) correlated with increase in ConnectionIdle errors in HttpSys and BackendConnectionFailures specifically on gwhost_33. ReplaceVM operation initiated (apim-prod-eu-01_ManageCompute_63b54556). After VM replacement completed, incident mitigated. 5. Monitoring Miss? Yes. No LSI filed before CRI. gwhost_33 had ConnectionIdle and BackendConnectionFailure errors since 2026-04-28 but no automated alert fired. Issue discovered only via customer support case filed 2026-05-01 — approximately 3 days after issue began. 6. Repairs No repair items documented.	Review Note: Requires review — faulty VM, moved to platform for RCA Work Items: To Do Alert on per-VM ConnectionIdle/BackendConnectionFailure
INC 21000001017622 Sev2 Active Active → Sev3 Title: Inconsistent Responses Observed for Azure APIM Cache DRI: Nima Kamoosi (nimakamoosi) Created: 2026-05-11 Build: 0.51.27763.0 Scope: Built-in cache only (not external Redis)	1. Customer Complaint Customer reports cached responses returning truncated data (~2 MB instead of expected ~7 MB). Silent truncation with no error codes or indicators. Impact on data integrity — customers unknowingly serving incomplete payloads downstream. 2. What customer saw Responses that should be ~7 MB returned as ~2 MB when served from built-in cache. No error codes, no truncation headers — completely silent data loss. Only affects built-in cache; external Redis cache returns full responses correctly. 3. CSS telemetry Regression tied to build 0.51.27763.0. ~2 MB truncation boundary suggests hardcoded buffer size or misconfigured limit introduced in that build. External Redis unaffected confirms issue is in the built-in (in-memory) cache path, not the caching policy logic itself. 4. Diagnosis & Fix Cache truncation regression in build 0.51.27763.0. ~2 MB silent truncation boundary for built-in cache responses. Root cause likely in Proxy/Gateway.Policies cache provider or in-memory cache buffer sizing. Customer workaround: skip caching for responses >2 MB. Fix pending — offending commit not yet identified publicly. 5. Monitoring Miss? Yes. Silent truncation means no observable failure signal. No validation exists to compare cached response size vs original response size. No error/warning emitted when response is truncated. 6. Repairs • Fail-loud behavior: emit error/warning when cached response is truncated (never silently serve partial data) • Identify and fix the ~2 MB buffer limit introduced in 0.51.27763.0 • Add cache integrity validation (compare stored size vs expected Content-Length) • Proactive notification to customers who may be affected but haven’t reported	Feedback: Silent truncation is a serious data integrity issue. Consider Sev2 retention given customers cannot programmatically detect corruption. Need repair item for fail-loud on truncation. Work Items: Pending — awaiting fix identification
INC 797275109 Sev2 Resolved Sev 2→3: incomplete info Title: Bleu - API Management Portal - Acceptance Testing Failed DRI: Javier (javierbo) Created: 2026-05-13 Resolved: 2026-05-14	1. Customer Complaint Bleu cloud acceptance testing — APIM Standard instance portal returning HTTP 503. 2. What customer saw https://{instance}.portal.azure-api.sovcloud-api.fr returned "HTTP Error 503. The service is unavailable." 3. CSS telemetry Incomplete information provided. Could not reproduce. 4. Diagnosis & Fix Unable to Reproduce. Information provided was incomplete, preventing investigation. Mitigated by javierbo. 5. Monitoring Miss? N/A. 6. Repairs N/A.	Feedback: (placeholder) Work Items: None
INC 21000001021441 Sev2 Resolved Sev2→3: Customer capacity Title: CCF and High capacity \| NTT DRI: Javier Borrego (javierbo) Created: 2026-05-13 Resolved: 2026-05-14 16:46 UTC Duration: 23h 18m Service: samurai-mdr-prod-northeurope-ff7fqjl6 (Basic SKU, North Europe) Customer: NTT	1. Customer Complaint High capacity and increased Client Connection Failures (CCFs) and timeouts across all APIs, severely impacting production workloads on a Basic SKU APIM service in North Europe. 2. What customer saw Connection closed events, timeouts across all APIs. Azure Front Door in front of APIM showed timeouts on ping tests. Issue started ~10 AM Eastern May 13. Scaling out at 16:20 UTC did not immediately resolve CCFs. 3. CSS telemetry AppLens detected capacity above 75%. SRE Agent analysis: 8x traffic surge over baseline (58K → 490K req/hr), CCF rates 22–27% uniform across ALL VMs (systemic, not per-VM fault). “events” API responsible for >90% of errors. Backend p95 latency 90–140s causing gateway to hold connections and saturate CPU. BackendConnectionFailure secondary to gateway overload. No platform deployment or VNET issues found. 4. Diagnosis & Fix Customer capacity issue — Basic SKU saturated by 8x traffic surge. Root cause: Basic SKU (1 unit, 2 VMs) overwhelmed by sustained traffic escalation over 3 days, peaking at ~490K req/hr on May 13. Backend services responding at p95 latency of 90–140s, causing gateway VMs to hold connections until CPU exhaustion. Clients then closed connections (CCF). Secondary BCF errors to backend 20.105.12.67:443 as gateway lost ability to maintain outbound connections. Mitigation: Scale-out added VMs (gwhost_6 through gwhost_16), reducing 5xx from 10% to 0.23%. Customer upgraded backend service to higher SKU to resolve latency. Javier downgraded Sev2→Sev3 and resolved. 5. Monitoring Miss? No — AppLens correctly detected capacity >75%. This is a customer workload/SKU sizing issue, not a platform failure. 6. Repairs N/A — customer needs to upgrade from Basic SKU and optimize backend latency. Recommendations provided: evaluate Standard/Premium SKU, investigate “events” API traffic source, implement exponential backoff on client retries.	Feedback: Well-handled. SRE Agent triage rated "Effective" — correct root cause, good Kusto analysis, HIGH confidence hypothesis aligned with engineer conclusion. No improvements needed. Work Items: None — customer action required
INC 21000001014164 Sev2 Active Active → Sev3 Title: Tool not visible when converting existing API \| MCP conversion DRI: Nicholas (nbarreca) Created: 2026-05-08 Related: ICM 21000000966326	1. Customer Complaint Converting OAS API (~150 resources) with deep nested allOf/anyOf/oneOf/$ref into MCP server — only 3–4 tools visible instead of full set. 100% failure rate on MCP tools/list endpoint. Customer platform blocked. 2. What customer saw "No tools available" or MCP error -32001 (Request timed out). ClientConnectionFailure: connection unexpectedly closed. Conversion completed without error but silently dropped ~146 operations. 3. CSS telemetry NullReferenceException at OperationExtensions.GetObjectDefinition():198 — definition.Properties is null when OpenAPI uses allOf/$ref composition. Stack trace confirms pre-fix binary deployed (method signature lacks HashSet<string> visited parameter). 46 errors on this service, 1,333+ errors across 18 services globally in 7 days. 4. Diagnosis & Fix Known product defect — fix merged but NOT deployed. NRE in MCP gateway schema parser when allOf/anyOf/oneOf not resolved. Fix (Task 37471067 / ResolveCompositeDefinition()) authored by ondrejoprala Apr 13, merged to main ~late April. SKUv2 rollback to v0.50 means fix was lost. Workaround: flatten OpenAPI spec (replace allOf/anyOf with inline properties). 5. Monitoring Miss? No monitoring miss per se — functional limitation. But no validation exists to verify all operations successfully converted, and no deployment tracking caught the fix regression. 6. Repairs Task 37471067 (merged, awaiting deployment). Recommended: (1) Deploy fix urgently, (2) Add conversion validation comparing source op count vs tool count, (3) Surface warnings when operations silently dropped.	Review Note: Scope: Not isolated — 18 services globally, 1,333+ errors in 7 days. 100% failure rate on MCP operations for affected specs. Timeline gap: Issue started Mar 13, original ICM 21000000966326 filed Mar 30, fix PR authored Apr 13, merged ~late April, but SKUv2 rollback to v0.50 means fix was lost. This CRI filed May 8 as regression. DRI: Nicholas (nbarreca) Work Items: Active Task 37471067 - Fix MCP tools/list allOf/anyOf/oneOf (merged, not deployed)
INC 21000001014384 Sev2 Active Active → Sev3 Title: APIM failed to upgrade and left in unhealthy state DRI: Nima Kamoosi (nimakamoosi) Created: 2026-05-08 Service: apiportaltst (WEU, Premium, External VNet) Customer: NS.NL (Netherlands Railways) Duration: 9+ days stuck	1. Customer Complaint APIM service failed during platform upgrade, left in unhealthy state for 9+ days. Hundreds of internal users blocked from dev/testing. Premium service, management endpoint unreachable. Customer escalated multiple times. 2. What customer saw Service stuck in "Updating" state. Management endpoint unreachable. "Apply Network Settings" also fails. Portal shows unhealthy. No admin operations possible. 3. CSS telemetry Platform-initiated rollback (0.51.27763.0 → 0.49.25546.0) triggered by Sev1 INC 788655236 (DevPortal). Bootstrapper on target version cannot find machine certificate 916F5EED... — 8 failure cycles (30-min timeout each) over 5 hours. VMSS rolling upgrade failed: 100% unhealthy instances. Orchestration locked permanently. 4. Diagnosis & Fix Failed platform rollback — certificate incompatibility between versions. Target version 0.49.x requires cert not available on reimaged VMs (provisioned by 0.51.x only). Nima did ForceRuntimeUpgrade to 0.51.28257.0 (retake) on May 12 — partially recovered (1 instance up). Also found 3 tooling bugs: ACIS JSON deserialization error in State=4, NotEligible without reason, wrong version when Release Channel ≠ All. 5. Monitoring Miss? Yes. (1) Stuck orchestration not detected — no alert for services in failed upgrade state. (2) No auto-recovery mechanism. (3) Collateral damage from Sev1 rollback batch not validated per-service before execution. 6. Repairs No repair items linked. Recommended: (1) Alert on services stuck in Upgrading >2h, (2) Auto-recovery for failed rollbacks, (3) Validate cert compatibility before cross-version rollback, (4) Fix 3 infra/tooling issues found by Nima.	Review Note: Collateral damage from Sev1 (788655236) rollback. Service stuck 9+ days. Preview release channel auto-included in rollback batch without per-service validation. Nima found 3 tooling issues during mitigation: ACIS JSON deserialization bug, NotEligible without explanation, wrong version without Release Channel=All. Key question: How do we prevent rollback batches from breaking services with cert incompatibility? DRI: Nima Kamoosi Work Items: None — repair items recommended
RFA – INC 793018989 Sev2 RA Title: Power BI "Publish to Web" URLs fail to open Assisted by: Shubham (v-snashikkar)	Not an APIM issue. Per bridge call, platform was healthy with no regression in APIM. Loading screen issue on Power BI side. No follow-up actions needed.	Feedback: (placeholder) None — not APIM issue
RFA – INC 795262147 Sev2 RA Title: ML Workload Reliability dip <95% for GetOpenAICompletionsResponseAsync Assisted by: Nima (deanward bridge)	Action Taken: Requested disabling gateway.policy.cache.enable-background-refresh on all Cognitive Service APIM instances. Background cache refresh contributing to latency/reliability dips for OpenAI completion endpoints. Follow-up: Verify disabling resolves dip. Assess if cache policy needs broader config change.	Feedback: (placeholder) Follow-up: verify cache fix
RFA – INC 796152710 Sev2 RA Mitigated Title: GatewayOverhead latency — Sweden Central (AOAI Hub) Assisted by: Martin Dechev, Shubham (tomkerkhove) Root Cause: OS update machine restarts	1. Customer Complaint GatewayOverhead latency monitor triggered for Sweden Central AOAI Hub. P99 latency exceeded thresholds. 4. Diagnosis & Fix OS updates triggered machine restarts during business hours. Reduced capacity caused request queuing. Auto-mitigated after restarts completed (~1 hour). 5. Monitoring Miss? Monitoring detected correctly. Gap: OS updates scheduled during business hours without service windows. 6. Repairs PBI #37928771 - Platform should not allow OS updates during business hours (New, Tom Kerkhove). PBI #37928586 - Configure service windows on AOAI Hub (New, Tom Kerkhove).	Feedback: (placeholder) Work Items: New PBI 37928771 - Block OS updates during business hoursNew PBI 37928586 - Service windows for AOAI Hub
April 22 – May 46 incidents · ✓ REVIEWED (May 17)
INC 51000000976163 Sev2 Mitigated Title: TCP connection failed between on-prem to APIM service DRI: Maxim Agapov / Tuan Nguyen Customer: DTE Electric Company (ACE)	1. Customer Complaint DTE Electric Company reported timeout/connection failures from on-prem to APIM (Premium, Internal VNet, Central US). 50% of API calls failing, causing payment failures for hundreds of customers per minute. 2. What customer saw Intermittent TLS connection failures — TCP handshake succeeded but TLS handshake never completed. gwhost_1 did not respond to TLS ClientHello. 3. CSS telemetry PCAPs: GW Host 1 resetting TCP sessions. DNS/CRL/AIA endpoints unreachable on gwhost_1 only. Infrastructure-layer per-VM networking degradation. 4. Diagnosis & Fix Per-VM networking degradation on gwhost_1. Couldn't reach cert validation servers. Auto-healing replaced VM at 03:48 UTC Apr 9. Impact: ~16.5h. 5. Monitoring Miss? Yes — health checks don't probe per-VM TLS handshake. Auto-healing took ~16h. 6. Repairs Enhanced LB health probes with TLS validation. Faster auto-healing. CRL soft-fail resilience. Notes Central US datacenter infrastructure stress. Related IcMs for same region connectivity issues: 775500895, 775342041, 776075200, 776236264	Feedback: (placeholder) Work Items: New Bug 37466473 - Self-service gateway host mitigationNew Bug 37466169 - Telemetry to detect gateway host failures
INC 21000000991208 Sev2 Active Sev 2->3 Title: Self-Hosted Gateway hangs ~10 min DRI: Mahsa Sadi Created: 2026-04-20	1. Customer Complaint API calls via SHG hang until client timeout (~10 min). 2. What customer saw Requests hang indefinitely. Path: Client-AWS ELB-SHGW-Netskope-Backend. 3. CSS telemetry GatewayV2 timing out on large headers. Backend CSP header 7.7KB exceeds HTTP/2 HPACK MAX_HEADER_LIST_SIZE default (8192). 4. Diagnosis & Fix Code Bug - HTTP/2 HPACK limit 8192 in TcpChannelInitializer vs 65536 for HTTP/1.1. Fix ETA: 2 months. 5. Monitoring Miss? No - product bug. 6. Repairs Override Http2Settings to 65536. Expose net.client.http2.max-header-list-size.	Feedback: (placeholder) Work Items: Fix HPACK limit
INC 51000000996010 Sev2 Active Sev 2->3: customer issue Title: HM Electronics SSL CERT_VERIFY_FAILED DRI: Macko Treder Created: 2026-04-23	1. Customer Complaint HM Electronics (Sev A, Premium) - intermittent SSL cert verification failures. 2. What customer saw Intermittent SSL CERT_VERIFY_FAILED on backend calls. 3. CSS telemetry Classified as customer issue. 4. Diagnosis & Fix Customer issue. Downgraded to Sev3. 5. Monitoring Miss? N/A. 6. Repairs N/A.	Feedback: (placeholder) Work Items: None
INC 51000000996953 Sev2 Active Sev 2->3: reporting issue Title: APIM does not scale out DRI: Martin Dechev Created: 2026-04-24	1. Customer Complaint Customer reported unable to scale out. 2. What customer saw Scale-out appeared broken. 3. CSS telemetry Customer actually has 60 instances. Problem is orchestration logs/container size reporting. 4. Diagnosis & Fix Not a scaling issue - reporting/logging problem. Martin investigating. 5. Monitoring Miss? No (reporting issue). 6. Repairs Fix orchestration log reporting.	Feedback: (placeholder) Work Items: None
INC 21000000998761 Sev2 Resolved Title: APIM service is down DRI: Macko Treder / Gleb Feoktistov Created: 2026-04-25	1. Customer Complaint Customer reported APIM service completely down. 2. What customer saw Service unreachable, all traffic returning 500s. 3. CSS telemetry Auto OS rolling upgrade triggered destructive VMSS model update from scale-out. All VMs lost Redis. Health monitor deadlocked. 4. Diagnosis & Fix Scale-out triggered destructive VMSS upgrade. ~7h outage. Macko mitigated with Reboot Apr 25. 5. Monitoring Miss? Yes - cascading failure not caught until customer reported. 6. Repairs Rolling upgrade guardrails. Redis resilience. Reboot as preferred first mitigation.	Feedback: (placeholder) Work Items: None
INC 51000001003804 Sev2 Mitigated Title: APIM Scale out failure DRI: Maxim A / Ethan Lao Created: 2026-04-30 Service: apim-ads-cus-entbusops-prd-001 (Premium, Central US) Impact: ~5,000 employees affected Activations: 4 distinct activations	Activation 1 (Apr 30 16:39 – 18:40 UTC) 1. Customer Complaint Scale-out operation for Premium APIM service in Central US failing consistently. Previously working fine. ~5,000 customer employees affected by inability to scale. Service running at capacity above 75%. 2. What customer saw Service failed to scale. Service running at capacity above 75%. Spike of 5xx. 3. CSS telemetry Failed scale orchestration. 4. Diagnosis & Fix Scale operation failed: Error Message: VM 'gwhost_1431' has not reported status for VM agent or extensions. No log in ApiSvcHost. Replace node 1431. Scale Completed. Activation 2 (Apr 30 20:21 – 22:50 UTC) 1. Customer Complaint Issues with functional Service new spike of 5xx. 2. What customer saw Service new spike of 5xx. 4. Diagnosis & Fix Node constantly report "Timer_ConnectionIdle". Replace node 1450. Scale completed. gwhost_1453 has higher number of client connection failures than other nodes — Replaced. Activation 3 (May 1 07:55 – 10:51 UTC) 1. Customer Complaint Issues with functional Service new spike of 400/429. 2. What customer saw Service new spike of 400/429, Failed requests. 4. Diagnosis & Fix Traffic burst to "authenticationhelperapi" + ratelimiting policy blocked ~25% of the traffic. 429 and 400 return codes. Increased Limit to unblock. Then returned limit. Manually scale. Activation 4 (May 1 12:09 – 13:54 UTC) 1. Customer Complaint Issues with functional Service new spike of 5xx. 2. What customer saw Service new spike of 5xx. 4. Diagnosis & Fix All nodes returned 5xx. Not APIM Issue.	Feedback: (placeholder) Work Items: None
April 8 – 1410 Sev2 CRIs · 2 RAs · ✓ REVIEWED
INC 51000000976518 Sev2 Mitigated Title: 408 request time out errors DRI: Macko	1. Customer Complaint Intermittent 408 request timeout errors on Premium SKU v1 service. 2. What customer saw Clients received 408 timeouts. Requests never appeared in ProxyRequests logs. 3. CSS telemetry PCAPs: client reusing source ports too quickly. SYN packets with reused ephemeral ports in TIME_WAIT state. 4. Diagnosis & Fix Customer Error. Client reusing source ports aggressively with improper connection pooling. Customer advised to fix. Macko mitigated Apr 9. 5. Monitoring Miss? No — client-side issue. 6. Repairs N/A.	Feedback: (placeholder) Work Items: None
INC 51000000976727 Sev2 Mitigated Title: Issue in connecting to APIM workspace gateway DRI: Rafal Customer: Healthcare (S500)	1. Customer Complaint Healthcare S500 — workspace gateway cert expired, 5AM data sync failed. Mission-critical outage. 2. What customer saw SSL/TLS trust relationship failure on workspace gateway endpoint. Another workspace on same service working fine. 3. CSS telemetry WorkspaceGatewayWebsiteSslCertificateItemDetails showed expired cert. No ProxyRequest logs (requests never reached gateway). 4. Diagnosis & Fix Gateway (Workspace) — managed SSL cert expired, not auto-renewed. Tuan renewed cert + shallow update to rotate at-risk certs. Mitigated Apr 9. 5. Monitoring Miss? Yes — no alert for workspace gateway cert expiration. 6. Repairs Cert expiration alerting for workspace gateways.	Feedback: (placeholder) Work Items: None
INC 51000000976845 Sev2 Mitigated Title: CX DTE Electric Company - host dropping traffic DRI: Maxim A Customer: DTE Electric Company	1. Customer Complaint Second DTE service (dte-cu-prod-azure-apps-apim-prod) also dropping traffic from degraded gwhost_4. 2. What customer saw Same pattern as 976163 — traffic drops after TCP handshake, before TLS. 3. CSS telemetry Same infrastructure-level VM networking degradation. Both DTE services affected by same regional issue. 4. Diagnosis & Fix VM replacement. Tuan replaced degraded VM Apr 10. Customer RCA delivered. 5. Monitoring Miss? Yes — same gap as 976163. 6. Repairs See INC 976163.	Feedback: (placeholder) Work Items: See INC 976163
INC 776736824 Sev2 Mitigated ACE Declared Outage Title: <ACE Declared Outage> DTE Energy host dropping traffic DRI: Maxim A	1. Customer Complaint ACE Declared Outage wrapper for DTE Energy issue (SR 2604090040004283). 2-4. Same as INC 976845. Mitigated by Tuan Apr 10 at 16:20 UTC. 5. Monitoring Miss? Yes — same. 6. Repairs See INC 976163.	Feedback: (placeholder) Work Items: See INC 976163
INC 777722043 Sev2 Mitigated Outage Declared Title: Huge number of 500s (ExpressionValueValidationFailure on cache-value) DRI: Macko Customer: AOAI / Cognitive Services	1. Customer Complaint AOAI team — HTTP 500 from ExpressionValueValidationFailure on cache-value policy. Impacted cognitivewcusprod (3,590 errors/3h). 2. What customer saw HTTP 500 for management/update actions. cache-value refresh-after evaluated outside valid range [1, 2147483647]. 3. CSS telemetry Unintended upgrade to unsupported build via misconfigured orchestration + ForceUpgrade feature flag bypassing SDP. 4. Diagnosis & Fix Service - Configuration. ForceUpgrade deployed unsupported version. Rollback + unlock stuck services + disable ForceUpgrade + quarantine. Rapopescu mitigated Apr 11. 5. Monitoring Miss? Partially — ForceUpgrade bypassed release safeguards. 6. Repairs Disable/restrict ForceUpgrade. Validate cache-value at compilation time.	Feedback: (placeholder) Work Items: New PBI 37511380 - ACIS quarantine AOAI HubActive PBI 37511341 - Dedicated release channel
INC 778045793 Sev2 Mitigated Title: Content safety timeout/failure DRI: Tuan	1. Customer Complaint Content Safety / AOAI API requests failing consistently after build rollback. 2. What customer saw HTTP 408 timeouts. Systematic failures after onset. 3. CSS telemetry APIM → RP (checkAccess) → connection failure → 408. Build rolled back but background refresh config remained → invalid combination. 4. Diagnosis & Fix Invalid build + config combination after rollback. Disabled background refresh on affected services. 5. Monitoring Miss? Validate config compatibility on rollback. 6. Repairs See INC 777722043.	Feedback: (placeholder) Work Items: See INC 777722043
INC 21000000983047 Sev2 Mitigated Title: Custom domain cert update failed - AzureFirstPartyServiceTag DRI: Tom Service: shared-apim-eas-prd-01 (East Asia)	1. Customer Complaint Custom cert update failed: "Unable to Update API service with vnet injection." Cert expiring Apr 17. 2. What customer saw IPTagsCannotBeModifiedForExistingStaticPublicIPAddresses — RP tried to add IP tags to existing static PIP. 3. CSS telemetry BetaFeature for IPTags applied globally as wildcard, modifying existing PIPs (immutable). 4. Diagnosis & Fix RP Regression. Tom removed AzureFirstPartyServiceTag config Apr 14. Resolved Apr 16. 5. Monitoring Miss? Yes — global wildcard rule not caught in deployment review. 6. Repairs Fix BetaFeature to not apply IPTags to existing static PIPs.	Feedback: (placeholder) Work Items: Done PBI 37527374 - Log all code pathsDone PBI 37527339 - Disallow * scopeResolved Bug 37526292 - Block on 3P services
INC 51000000982831 Sev2 Mitigated Title: APIM stuck on updating state for over 6 hours DRI: Macko Service: core-live-we-0f3d-apim (Premium, WEU)	1. Customer Complaint APIM stuck in "Updating" for 6+ hours. All critical public-facing apps down due to expired cert. 2. What customer saw Cert update triggered ~200min process that got stuck. VMSS deployment failed (Conflict). Only 2/6 VMs serving. 3. CSS telemetry Initial VMSS failure at 09:51 UTC (3h15 after start). Retry stuck 2+ hours with no logs. 4. Diagnosis & Fix RP — VMSS update stuck. Retry mechanism stuck with no logging. Macko mitigated Apr 15 with ad-hoc steps. 5. Monitoring Miss? Yes — no alert for long-running stuck updates. 6. Repairs Alert on operations exceeding expected duration with no progress.	Feedback: (placeholder) Work Items: None
INC 21000000984572 Sev2 Mitigated Title: APIM default gateway cert expired – not auto-renewed DRI: Tuan	1. Customer Complaint Default gateway cert expired Apr 9. Microsoft-managed TLS cert not auto-renewed. Production down. 2. What customer saw NET::ERR_CERT_DATE_INVALID. All HTTPS traffic blocked on default hostname. 3. CSS telemetry Cert expired, entirely Microsoft-managed. Customer cannot renew/replace. 4. Diagnosis & Fix Gateway (Managed) — auto-renewal failed. Tuan ran ACIS to renew Apr 15. Rafal resolved Apr 21. 5. Monitoring Miss? Yes — no alert for managed cert approaching expiry. 6. Repairs Cert expiration alerting for managed certs.	Feedback: (placeholder) Work Items: None
INC 779770640 Sev2 Mitigated Title: Emerging Issue - Unable to deploy PremiumV2 in East US2 DRI: Tuan Impact: ~March 26 (~3 weeks)	1. Customer Complaint Multiple customers unable to deploy PremiumV2 in East US 2. Misleading error about activation limits. 2. What customer saw "PremiumV2 SKU activation limit reached." Actual cause: ApiServicePrepoolExhaustedException (HTTP 503). Issue ~3 weeks old. 3. CSS telemetry Kusto: HTTP 503 with ApiServicePrepoolExhaustedException in East US 2. 4. Diagnosis & Fix Resource Pool Exhaustion. Dan closed East US 2 for PremiumV2 activations Apr 21. Some capacity for exceptions. 5. Monitoring Miss? Yes — no alert for prepool exhaustion. Persisted ~3 weeks. 6. Repairs Fix misleading error message. Proactive prepool exhaustion alerting per SKU/region. Capacity planning.	Feedback: (placeholder) Work Items: None
RA – INC 776091817 Sev2 RA Title: Picasso Activity failures (westus2) Date: 2026-04-09 Requester: Picasso / MAI Platform Bridge: Tuan	Customer migrated APIM to AOAI but saw SSL handshake error. Rolled back but requests not reaching. DNS cache refresh too long. Self-recovered after cache expiry.	Feedback: (placeholder) None
RA – INC 780416322 Sev2 RA Title: Logic Apps WEU SLI drop Date: 2026-04-16 Requester: Logic Apps Bridge: Raluca/Omar/Dan	Redis cache failures on logic-apim-westeurope. No obvious Redis problems. Asked LA to mitigate. If recurring: increase local cache via UpgradeV2 Custom Settings.	Feedback: (placeholder) None
April 15 – 213 Sev2 CRIs · 2 RAs · ✓ REVIEWED
INC 780345525 Sev2 Mitigated Title: APIM not sending Email notifications DRI: Omar Created: 2026-04-15 Impact: 2026-04-15 01:30	1. Customer Complaint Multiple customers - APIM not sending any emails. Dev Portal password resets, invitations, admin notifications all broken. 2. What customer saw No emails from apimgmt-noreply@mail.windowsazure.com. Flows completed in UI but no email arrived. 3. CSS telemetry No failures in APIM email pipeline. No active SMTP outages. ~6 days before declaration. 4. Diagnosis & Fix DDoS email flood - Free Trial subscriptions flooded emails during DDoS, SMTP flagged APIM address as spam. Dan fixed with hotfix Apr 21. 5. Monitoring Miss? Yes - no email delivery success rate monitoring. ~6 days elapsed. 6. Repairs Rate-limiting. Email delivery metrics. Drop-off alerting. Block FreeTrial emails.	Feedback: (placeholder) Work Items: Done Task 37580153 - Disable EMAIL Free TrialDone Task 37594340 - Throttling on QuotaDone Task 37631504 - Block FreeTrial emails
INC 780717094 Sev2 Resolved Title: Australia East activations fail - DNS quota DRI: Tom Created: 2026-04-16 Customer: AOAI Hub	1. Customer Complaint AOAI Hub - Workspace/Hub gateway activations failing in Australia East. 2. What customer saw Activation failures. New gateways not provisioned. 3. CSS telemetry DNS record quota exhausted (10k limit). 4. Diagnosis & Fix Capacity - DNS team increased 10k to 30k Apr 16. Resolved Apr 23. 5. Monitoring Miss? Yes - no DNS quota alerting. 6. Repairs DNS quota monitoring at 80%. Cleanup deprovisioned gateways. Dedicated DNS zones for AOAI Hub.	Feedback: (placeholder) Work Items: Done PBI 37567190 - Alert DNS recordsDone PBI 37565539 - DNS in resource poolsActive PBI 37566932 - Dedicated DNS zonesNew PBI 37660220 - DNS in RCM
INC 21000000989825 Sev2 Mitigated Title: Workspace gateway not working DRI: Omar (with Kriti) Created: 2026-04-17 Customer: S500 Healthcare	1. Customer Complaint S500 healthcare - workspace gateway unreachable, production impacted. 2. What customer saw Gateway connectivity failure. Events not processed. 3. CSS telemetry QueryEventsFailed: 403 AuthenticationFailed. Internal SAS token expired. Recurred on each upgrade (expired credential re-applied). 4. Diagnosis & Fix Software defect - SAS URL expired, not auto-renewed. Upgrade re-deployed stale credential. Omar and Kriti worked together to resolve — Kriti refreshed SAS Apr 17. 5. Monitoring Miss? Yes - no credential expiration monitoring. 6. Repairs Code fix for credential renewal. Health checks for auth validity.	Review Note: Monitor missing — reviewed Work Items: Resolved Bug 37646918 - Event Table conn string
RA - INC 780416322 Sev2 RA Title: Logic Apps WEU SLI drop Date: 2026-04-16 Requester: Logic Apps Bridge: Raluca/Omar/Dan	Redis cache failures on logic-apim-westeurope. Known issue, fixed as part of AOAI work. Asked LA to mitigate. If recurring: increase local cache via UpgradeV2 Custom Settings.	Feedback: (placeholder) Known issue — fixed (AOAI)
RA - INC 782941242 Sev2 RA Title: Dataflow Refresh SLA Brazil Date: 2026-04-20 Requester: Power Query (PPDF Dataflows) Bridge: Kriti	Owned by PPDF Dataflows team, not APIM. Kriti was RA'd in the middle of the night. SLA drop in Brazil South — mitigated once the SLA was restored on its own. Our investigation confirmed APIM was healthy. No action required from APIM.	Feedback: (placeholder) None — not APIM issue
May 5 – 111 Sev2 CRI · ✓ REVIEWED
INC 51000001009573 Sev2 Mitigated Title: Unwanted log entries in APIM Application Insights DRI: Maxim A Created: 2026-05-06 Mitigated: 2026-05-06 07:22 UTC TTM: 1h 3m Customer: 4 Premium services (Australia) Related: ICM 21000001008018	1. Customer Complaint Starting April 28, unwanted log entries for send-one-way-request policy and /ext_cap/v1/req_res_cap path appeared in Application Insights for every API call. Entries unrelated to business processes, causing confusion during debugging. Customer also requested quarantine of their 4 Premium tier services in Australia. 2. What customer saw Unwanted dependency log entries in App Insights Failures blade for every API call. Entries showed send-one-way-request fire-and-forget calls logged as failed dependencies (ResponseCode=0, Success=false). Initially observed on Developer tier only; Premium tier services had not yet been upgraded to affected gateway version (0.51.x). 3. CSS telemetry CSS confirmed unwanted log entries. SRE Agent verified via GatewayHeartbeat that all 4 Premium services were on version 0.50.27283.0 (pre-regression), while 0.51.x rollout was 50–70% complete across Australia regions. Root cause: gateway 0.51.x introduced new `TrackOutgoingRequestDependencies()` method in ApplicationInsightsLogPublisher.cs that tracks all policy-initiated outgoing requests as App Insights dependencies, including fire-and-forget calls. 4. Diagnosis & Fix 0.51 regression in Application Insights dependency tracking. Root cause established in related ICM 21000001008018. Gateway 0.51.x introduced per-policy outgoing request dependency tracking that logged send-one-way-request as failed dependencies (ResponseCode=0). Tom Kerkhove directed Maxim A to apply feature flag `"logs.applicationinsights.dependency.legacy": "true"` as custom settings on affected services. Settings applied, incident mitigated. 5. Monitoring Miss? Yes. No LSI filed before CRI. App Insights dependency tracking regression in 0.51.x discovered only through customer reports. False-positive failed dependencies in customer App Insights not covered by platform monitoring. 6. Repairs • Product fix needed: Exclude send-one-way-request from DependencyTrackingSources, or handle ResponseCode=0 as neutral (not failed) for fire-and-forget policies • Continued tracking in ICM 21000001008018 • Quarantine impact: Feature flag blocks all 0.51 features. Gateway team to evaluate targeted fix backport to 0.50.x or fast-track in 0.52.x	Review Note: Does not need review Work Items: To Do Fix send-one-way-request dependency tracking in 0.52To Do Evaluate quarantine duration for Premium services
Reviewed in Meeting — May 95 incidents · ✓ REVIEWED
INC 21000000995987 Sev2 Active Downgraded Sev3 Title: SSL CERTIFICATE_VERIFY_FAILED DRI: Martin Created: 2026-04-23 Customer: Network Rail (Mission Critical)	1. Customer Complaint SSL certificate verification failures on backend calls. 2. What customer saw SSL CERTIFICATE_VERIFY_FAILED intermittently on backend connections. 3. CSS telemetry Transferred to Platform for deeper RCA. 4. Diagnosis & Fix Result of migrating VMs to 1P image. Intermediate certificates (and CRLs) not immediately available because customer NSG blocks outbound port 80 (not configured as documented). Not expected to repeat unless VM replace/reimage without proper NSG configuration. Team actively rebooting VMs for services where bootstrapper complains about the chain. Unconfirmed but lack of intermediates likely temporary — AzSecPack script probably installs them in background. 5. Monitoring Miss? TBD — related to broader certificate chain emerging issue pattern. 6. Repairs Cannot suggest repair items as this is part of 1P migration. Active reboots for affected services. Customer must configure NSG properly (outbound port 80) to prevent recurrence on future reimage.	Review Note: Review during emerging issue since related to it Work Items: Part of 1P migration — no separate repair items
INC 787095677 Sev2 Mitigated Outage Declared Title: Grandfathered Limits not applied after 0.51 DRI: Macko Created: 2026-04-27	See full incident details in ICM.	⚠ PIR Note: PIR Required on Thursday
INC 788655236 Sev2 Mitigated Outage Declared Title: Dev Portal Registration Fails - "User registration is not supported." DRI: Maxim A Created: 2026-04-30	See full incident details in ICM.	⚠ PIR Note: PIR required — next Thursday? Roman / Ondrej / Rafal
INC 789605450 Sev2 Active Title: Intermittent 500s DotNetty.EncoderException (AOAI) DRI: Ethan Created: 2026-05-01	See full incident details in ICM.	⚠ PIR Note: Involve Tom for monitoring aspects
INC 791563622 Sev2 Mitigated Emerging Issue Title: Incomplete certificate chain for Gateway Endpoint After Upgrade DRI: glfeokti / tehnoonr Created: 2026-05-04 Mitigated: 2026-05-04 22:38 UTC TTM: 1m 37s (tracking LSI) Impact: Multiple customers (VNET-injected V1 SKU)	1. Customer Complaint Multiple customers reported that after OS Upgrade maintenance, their APIM gateway endpoint presented an incomplete certificate chain. Filed as emerging issue (CSS-sourced) affecting classic/V1 SKU VNET-injected services where outbound TCP port 80 is blocked. Escalated Sev3→Sev2 due to multiple prior Sev2 CRIs. 2. What customer saw Clients connecting directly saw incomplete cert chain and SSL handshake failures. Customers with AFD or AppGW in front of APIM received HTTP 502 errors. Manifested specifically after OS upgrade events on VNET-injected V1 SKU services. 3. CSS telemetry AppLens detected capacity above 75% during impact. No additional telemetry findings beyond automated analysis. 4. Diagnosis & Fix External/Customer Issue - VNET. Occurs on VNET-injected V1 SKU services where outbound TCP port 80 blocked. Customer fix: (1) allow outbound TCP port 80 to Internet, (2) click "Apply Network Settings" from Portal Network blade. Note: "Apply Network Settings" alone temporarily mitigates until next OS upgrade — suggests Bootstrapper performs different cert chain assembly during reboot vs re-image. Mitigated as emerging issue tracking LSI since fix is documented customer config change. 5. Monitoring Miss? Yes. No LSI or proactive alert before CRI. Multiple Sev2 CRIs filed by customers — consistently discovered through customer reports. No existing monitor detects incomplete cert chains on gateway endpoints after OS upgrades. 6. Repairs • Investigate why Bootstrapper handles cert chain differently during re-image (OS Upgrade) vs reboot • Create awareness of emerging issue pattern • No formal repair items attached	Review Note: PIR ?? Work Items: To Do Investigate Bootstrapper reimage vs reboot cert handlingTo Do Cert chain completeness monitor
INC 783036395 Sev2 Active Emerging Issue Title: Suspended Services do not recover after subscription renewal DRI: Omar Macias / Kriti Majumdar Created: 2026-04-20 Impact: Multiple customers stuck for days	1. Customer Complaint Customer services were stuck and not recovering. Subscriptions re-enabled and services not coming back. 2. What customer saw Service shown as Deleted/Suspended in Portal for days after customer being unsuspended. No actual error or banner. 3. CSS telemetry Services not being created, no telemetry at all for 3 different CRIs. CSS opened an emerging issue. 4. Diagnosis & Fix Undelete queue blocked by poison pills. With SRE Agent help identified that `SubscriptionLifecycleOrchestration` runs 5 steps: (1) WarnSuspendedContainers, (2) SuspendWarnedContainers, (3) SuspendAndWarnActiveContainers, (4) TerminateContainersWhoseSubscriptionIsUnRegisteredOrDeleted, (5) ActivateContainersForSubscriptionWhichGotRegistered. Orchestration hardcoded to max 20 services per iteration. ~14 services persistently failing to undelete (SKUv2 RG quota, P1v3 pool exhaustion). With 12-16 failing per cycle + previous steps filling the bucket, no opportunity to process undelete queue. Fix: Kriti unblocked 7 services failing with P1v3. Team unblocked SQL size and SKUv2 RG quota failures to free space. Nina made hotfix to skip poison pills so they no longer block queue completely. 5. Monitoring Miss? Yes — No SLA or completion monitoring for undelete. No check on how long a service has been waiting to be undeleted. Undelete buried inside SubscriptionLifecycleOrchestration as the last possible step. Also found SKUv2 preprovisioning RGs with quota exhaustion (MSI 800/800, ServerFarms 100/100) — creating and leaving them regardless of undelete success/failure. 6. Repairs • Done: Skip hotfix rolled out behind beta feature (poison pill bypass) • Needed: Cleanup of quota-exhausted SKUv2 preprovisioning RGs • Needed: Decouple undelete from SubscriptionLifecycleOrchestration • Needed: SLA metric / Alert on undelete consistently failing	Feedback: (placeholder) Work Items: Done Skip hotfix (poison pill bypass)To Do Cleanup quota-exhausted SKUv2 RGsTo Do Decouple undelete from SubscriptionLifecycleTo Do SLA metric / Alert on undelete failing

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw

3. CSS telemetry

4. Diagnosis & Fix

5. Monitoring Miss?

6. Repairs

1. Customer Complaint

2. What customer saw