⚠ FOR REVIEW 15 incidents — Sorted oldest first (May 6 – May 29)

These incidents have not yet been reviewed. Use the filter to find yours.

Incident / DRI / StatusQuestions & AnswersFeedback & Work ItemsFollow-up
INC 51000001009580
Sev2 Mitigated

Title: Intermittent 500 responses from APIM
DRI: Maxim A
Created: 2026-05-06
Mitigated: 2026-05-06 18:03 UTC
TTM: 9h 37m
Customer: myAIS (~10M subscribers)
Service: apim-adl-connectivity-hub-jpe-prd (Premium, Japan East)

1. Customer Complaint

Customer reported intermittent HTTP 500 GatewayFailure errors with "Unable to connect to the remote server" and transportErrorCode 10048 on their Premium SKU APIM service in Japan East. Impact on API traffic forwarding from APIM to backend services via Internal Load Balancer. Management endpoint remained accessible. Customer platform (myAIS) serves ~10 million subscribers for mobile billing/payments — high impact.

2. What customer saw

Requests routed through APIM intermittently failed with HTTP 500. Diagnostic logs: lastError_reason: GatewayFailure, lastError_message: "Unable to connect to the remote server", transportErrorCode: 10048 (WSAEADDRINUSE). Some requests succeeded while others failed. Direct requests to backend without APIM did not consistently reproduce.

3. CSS telemetry

CSS identified via ProxyRequest that all HTTP 500s with backendResponseCode=0 originated from a single VM (gwhost_14058). SRE Agent confirmed 3,781 GatewayFailure errors on gwhost_14058 vs zero on all other 23 VMs during 03:00–08:00 UTC window. Faulty VM had 7x higher backend latency (73.8ms vs ~10ms peers) and 21.3% connection failure rate to customer ILB at 172.16.9.5. SNAT port exhaustion ruled out.

4. Diagnosis & Fix

VM-level networking degradation on gwhost_14058 causing elevated outbound connection latency, leading to TCP TIME_WAIT accumulation and ephemeral port exhaustion (error 10048/WSAEADDRINUSE). Gradual onset over ~2 hours consistent with port pool saturation. Two contributing factors: (1) delayed scaling from 2 to 24 VMs simultaneously caused traffic processing degradation, (2) single VM behaved abnormally after rapid scale-out.

The VMSS model was in an inconsistent state from the failed May 4 rolling upgrade — so new VMs scaled out on May 5 ~23:00 inherited the stale/broken model.

Fix: VM replaced via Geneva Action (ManageCompute orchestration apim-adl-connectivity-hub-jpe-prd_ManageCompute_d78f01e7). Last error at 09:37 UTC, no further 500s post-replacement.

5. Monitoring Miss?

Yes. No LSI filed before CRI. VM-level degradation did not trigger health monitoring because all NodeHeartbeat and Management checks passed on gwhost_14058. Single-VM ephemeral port exhaustion with elevated backend latency is not currently detected by existing monitors.

6. Repairs

No formal repair items documented. SRE Agent noted backend-returned 500s from myais-be.cloud.ais.th (backendResponseCode=500) are a separate customer application issue — recommended clarifying the two distinct error populations to the customer.

Feedback: (placeholder)
Work Items:
None
INC 51000001019637
Sev2 Resolved
Customer Error — Sev2→3

Title: Received invalid status code: 500
DRI: Ondrej Oprala (ondrejoprala)
EIM: Maxim Kim
SIM: Martin Dechev
Created: 2026-05-13
Resolved: 2026-05-14 16:15 UTC
Service: T7PAPIMUKS01 (PremiumV2, UK South, VNET-injected)
Blast Radius: 3 PremiumV2 services in UK South
Duration: Crash loop May 10 → full outage May 12 19:55 UTC

1. Customer Complaint

APIM V2 service unreachable. Application Gateway health probe receiving HTTP 500 from APIM backend, resulting in complete service outage for T7PAPIMUKS01 (PremiumV2, UK South, VNET-injected).

2. What customer saw

Application Gateway reported "Received invalid status code: 500 in the backend server's HTTP response." All traffic failed — APIM gateway process in persistent crash loop, returning default App Service error page (500) to health probes.

3. CSS telemetry

ProxyInfra: 107 consecutive GatewayStartFailed events since May 2. Exception: Autofac.Core.DependencyResolutionException → QuotaComponent → System.ArgumentException: Connection string parsing error at Azure.Data.Tables.TableConnectionString.Parse(). Webapp setting policy.qouta.sync.table.connection uses KeyVault reference that failed resolution ("Reference was not able to be resolved"). Traffic dropped to zero at May 12 19:55 UTC when last healthy App Service worker recycled. 3 PremiumV2 services in UK South affected simultaneously (T7PAPIMUKS01, apim-core-p-uks, rlg-sbx-apimgmt-uks-apim-v2).

4. Diagnosis & Fix

Customer Error — Fortigate NVA blocking outbound TCP to Key Vault/Storage.

Root cause: Customer added a default route (0.0.0.0/0 → Fortigate NVA at 192.168.4.6) to the APIM subnet route table on April 20th (change 181503, approved by Nitesh Kumar). The Fortigate silently dropped packets to Key Vault public endpoints — SYN packets retransmitted with no response. This prevented App Service from resolving the @Microsoft.KeyVault(...) reference, leaving the gateway with an unparseable connection string on every startup attempt.

Richard Cao (networking) confirmed: curl from APIM node (10.58.8.250) to Key Vault timed out. Process-tuples showed traffic correctly delivered to Fortigate VM but silently dropped. Customer’s NSG rules were also initially missing required allowances.

Mitigation: Customer added additional routes to bypass Fortigate for Key Vault/Storage traffic → curl to Key Vault succeeded → webapp restarted → gateway recovered.

5. Monitoring Miss?

Partial. Gateway crash loop ran for 3 days (May 10–12) before complete outage. No alert for “all new worker startups failing” pattern. However, root cause is customer-controlled network configuration — APIM cannot monitor customer NVA behavior.

6. Repairs

• SRE Agent improvement: new known pattern “SKUv2 Gateway Crash Loop — NVA/Firewall Blocking Dependencies” added (PR #2298) to prevent misattributing to SAS expiration
• Agent fix: ForceRuntimeUpgrade does not work for PremiumV2 (resolves SKUv1 package path)
• No APIM platform repair needed — customer network misconfiguration

Feedback: Good collaborative investigation (Ondrej, Nima, Richard Cao, Jarod Aerts, Oliviu). SRE Agent initially misattributed to SAS expiration (MEDIUM confidence noted). Agent also exceeded posting budget (4+ posts vs 2). Triage learning PR #2298 filed to prevent recurrence of misdiagnosis.
Work Items:
Done PR 2298 - New known CRI pattern for NVA/firewall blocking
INC 51000001023424
Sev2 Mitigated
Platform Bug — SKUv2 Orchestration

Title: APIM not refreshing certificate for Traffic Manager custom domains
DRI: Javier Borrego (javierbo)
SIM: Alexander Zaslonov
EIM: Maxim Kim
Created: 2026-05-15
Mitigated: 2026-05-15 23:04 UTC
Duration: 2h 34m
Service: apim-cpgapi-prod-eus & apim-cpgapi-prod-wus (BasicV2)
Customer: Internal Microsoft (CE-EA-LM-CPG)
Risk: Cert expires June 12 — TLS will break if not permanently fixed

1. Customer Complaint

APIM service not refreshing the certificate for the Traffic Manager profile under custom domains. Certificate for contactpermissionsgatewayapi.trafficmanager.net not being auto-renewed. Affects both East US and West US prod services.

2. What customer saw

Certificate refresh failing silently. No TLS disruption yet (existing cert still valid), but most recent App Service cert expiring June 12 — imminent risk. Issue persisted since July 2024 with 10 accumulated expired certs.

3. CSS telemetry

SRE Agent identified root cause with HIGH confidence from Orchestration telemetry:
UpdateSkuV2ServiceFailedDueToInvalidInput recurring every ~6h (WUS) and ~20h (EUS) since May 6
• StatusMessage: "The traffic manager domain can be removed only through the Traffic Manager"
• 5 retries with exponential backoff (10s→160s), all permanently failing
• 37 failures on WUS + 12 on EUS in 10-day window
• 10 stale certs accumulated on App Service dating back to July 2024
• Valid KeyVault cert (thumbprint C84F343B, expiry 2026-08-25) loaded by RP but cannot be bound

4. Diagnosis & Fix

Platform Bug — SKUv2 orchestration cannot handle Traffic Manager hostname bindings.

The UpdateSkuV2Service orchestration calls DeleteHostNameBindingAsync() (ApiServiceOrchestrationBase.cs L4465-4475) to remove the old binding before re-creating with the updated cert. App Service rejects the DELETE because *.trafficmanager.net bindings can only be managed through the Traffic Manager resource, not the web app hostNameBindings API. The RP treats this permanent constraint as transient (retries 5x), fails, and commits state with RecentlyAbortedUpdateHostname=true.

Mitigation: Javier allowed new cert to sync into App Service and manually rebound the binding without deleting it, bypassing the failing delete/recreate flow. Validated production endpoint serving updated G2 certificate on bridge.

5. Monitoring Miss?

Yes. Issue existed since July 2024 (~10 months) with no alert. UpdateSkuV2ServiceFailedDueToInvalidInput events generated continuously but not monitored/alerted. Only surfaced when customer noticed cert approaching expiry.

6. Repairs

URGENT Bug: Fix ApiServiceOrchestrationBase.cs L4465-4475 to skip DeleteHostNameBindingAsync() for *.trafficmanager.net hostnames and update cert in-place (PUT with new thumbprint)
• Alert on repeated UpdateSkuV2ServiceFailedDueToInvalidInput events
• Cleanup 9 expired App Service certs on both web apps
• Deadline: June 12, 2026 — if not permanently fixed, TLS breaks again

Feedback: Long-standing platform bug (~10 months undetected). SRE Agent investigation excellent (HIGH confidence, 7 evidence blocks, cross-service correlation). Engineering fix needed urgently — next cert expires June 12. Manual rebinding is a temporary workaround only.
Work Items:
To Do Bug - Fix TM hostname cert refresh in ApiServiceOrchestrationBase.csTo Do Alert on repeated UpdateSkuV2ServiceFailedDueToInvalidInput
INC 800967470
Sev2 Active
Emerging Issue

Title: AI Foundry API import via the Azure portal is broken
DRI: Alexander Zaslonov (alzaslon)
Created: 2026-05-19
Mitigated: 2026-05-21 08:22 UTC
Impact: AI Foundry portal integration

1. Customer Complaint

AI Foundry API import functionality via the Azure portal is broken. Users unable to import APIs through the portal flow.

2. What customer saw

Azure portal API import flow for AI Foundry failing. Emerging issue impacting portal-based API onboarding.

3. CSS telemetry

Filed as emerging issue. Portal flow broken for AI Foundry API import.

4. Diagnosis & Fix

Emerging Issue. AI Foundry portal integration broken. Mitigated 2026-05-21.

5. Monitoring Miss?

TBD.

6. Repairs

TBD.

Feedback: (placeholder)
Work Items:
TBD
INC 51000001033443
Sev2 Mitigated

Title: Consumption SKU APIM, app service platform down
DRI: Gleb Feoktistov (glfeokti) / srajagrawal
Created: 2026-05-22
Mitigated: 2026-05-22 17:36 UTC
TTM: 5h 16m
Service: czm140-cur-prd-inc-apim (Consumption, East Asia)

1. Customer Complaint

Customer reported their Consumption SKU APIM service (czm140-cur-prd-inc-apim) was completely down. All API requests returning HTTP 503 "service is unavailable." Live production outage causing business disruption with SLA impact.

2. What customer saw

All API calls to public endpoint returned HTTP 503 "The service is unavailable" instead of normal responses. Browsing the base URL returned 503 instead of expected 404.

3. CSS telemetry

CSS confirmed zero gateway request logs since ~23:20 UTC on May 21 via Kusto on wawseas ApiGatewayRequest. AppLens showed the underlying web app was down since 23:20 UTC May 21, with availability at 93.07%. 503 pattern pointed to App Service platform issue rather than APIM gateway code.

4. Diagnosis & Fix

SAS URI invalidation after global storage key rotation. App Service web app entered crash-loop starting ~23:20 UTC May 21. DynamicCache logs showed repeated HTTP 403 from blob storage when downloading gateway package ZIP via WEBSITE_USE_ZIP. Root cause: global storage secrets rotated May 21, invalidating SAS URI for this pinned Consumption service. Could not update via normal RP channels due to known regression ("Unable to update/remove Consumption pinned version"). Fix: Geneva Action to unpin version + manually updated SAS URI.

5. Monitoring Miss?

Yes. No LSI filed. Storage rotation invalidated SAS URI causing crash-loop for 13+ hours before customer reported. No monitor detected the failure.

6. Repairs

• Geneva Action to unpin Consumption gateway version (only remaining pinned service)
• Known regression: "Unable to update/remove Consumption pinned version" must be resolved
• Systemic fix needed: storage key rotation must propagate to all active service packages

Feedback: (placeholder)
Work Items:
Fix Consumption pinned version regression; storage key rotation propagation
INC 21000001036015
Sev2 Mitigated
Sev A — Live Site

Title: APIM Gateway Down – ps-prod-be-euw-apim-manageprotect2 (West Europe)
DRI: Gleb Feoktistov (glfeokti)
Created: 2026-05-23
Mitigated: 2026-05-23 22:23 UTC
TTM: 1h 0m
Service: ps-prod-be-euw-apim-manageprotect2 (Premium, Internal VNet, West Europe)
Impact: Total outage ~12:47–22:23 UTC (~9.5h customer impact)

1. Customer Complaint

Total outage of APIM service (ps-prod-be-euw-apim-manageprotect2, Premium SKU, Internal VNet, West Europe). Resource Health event: "Your API Management service is down due to an unknown reason." APIM inbound endpoint failing, all production workloads fully impacted. Customer confirmed no changes on their side.

2. What customer saw

APIM gateway completely unavailable. Inbound API endpoint stopped responding. Resource Health event in Azure portal with message "service is down due to an unknown reason." Management endpoint also inaccessible.

3. CSS telemetry

Platform Availability 34.71% over 12h window. First drop to 0% at ~12:45 UTC. At 20:01:59 UTC: VMExtensionProvisioningError — "ApimBootstrapperService timed out starting after 6 retries." VMSS instances 16 and 7 failed with DSCConfiguration errors. Rolling upgrade: "100% of instances unhealthy after upgrade." SRE Agent confirmed traffic dropped to zero after 12:30 UTC; all instances in HostStartFailed loops with SSL cert store errors (netsh http show sslcert ExitCode=1).

4. Diagnosis & Fix

Hostname orchestration / cert manifest sync issue. VMSS rolling upgrade at 10:27 UTC produced replacement VMs with missing SSL cert store entries. Root cause chain: (1) hostname update orchestration failed after manifest upload at "generate settings" phase, (2) cert manifest out of sync with service config, (3) hostname-to-cert binding failing during bootstrap. Fix: glfeokti removed out-of-sync cert manifest from storage via Geneva Action. Known pattern match: IcM 51000001033991 (North Europe cert store failure after OS update).

5. Monitoring Miss?

Yes. Customer impact began ~12:47 UTC but CRI not filed until 21:23 UTC — 8.5 hours gap. No LSI or automated alert filed for Premium service at 0% availability for extended periods. Discovered only via customer support request.

6. Repairs

• Root cause: Resource Provider (hostname orchestration / cert manifest sync)
• Known recurring pattern (IcM 51000001033991) — needs systemic fix
• No explicit repair items created in incident record

Feedback: (placeholder)
Work Items:
Systemic fix for cert manifest sync; monitoring for 0% availability
INC 51000001037978
Sev2 Mitigated
Sev 1 Escalation — Publix

Title: API Management primary instance unresponsive at control plane, unable to scale alternative region
DRI: Srajagrawal
Created: 2026-05-26
Mitigated: 2026-05-26 21:00 UTC
TTM: 6h 4m
Customer: Publix
Service: cutpapmgdgtlsvcs02 (Premium, Internal VNet, East US 2)

1. Customer Complaint

Publix reported their Premium APIM instance (cutpapmgdgtlsvcs02) with Internal VNET mode was unresponsive at the control plane in primary region (East US 2). Scale operations to secondary region (Central US) also failing. Problem started ~2026-05-26 14:00 UTC.

2. What customer saw

Unable to access management endpoint. APIM scale operations failing with errors. Update ApiService orchestration returned AzureRestCloudException during VMSS operations in eastus2. Unable to scale out to alternative region to recover.

3. CSS telemetry

CSS (v-vyarlagadd) found all VMs in East US 2 unhealthy. Update ApiService orchestration failing with AzureRestCloudException during VMSS polling. Kusto: 63,171 DatabaseNotReachable events (Mapi table) + 48,762 (ApiSvcHost table) in 24h. Engineer srajagrawal confirmed spike in HTTP 500 codes and increased gateway latency because primary region down.

4. Diagnosis & Fix

Key Vault secrets (SQL connection strings) out of sync with service settings. Widespread DatabaseNotReachable errors across all VMs in East US 2. DRI (srajagrawal) attempted VM replacement of gwhost_8416 as initial step. Fix: glfeokti ran Upgrade operation to bring service settings and KV secrets (SQL connection strings) back in sync.

5. Monitoring Miss?

Yes (partial). Multiple Sev4 LoadBalancer Probe Unhealthy LSIs filed starting 2026-05-25 19:37 UTC (IcMs 805245267, 805245272, 805245305, etc.) — ~19 hours before CRI. However, only per-VM Sev4 alerts. No service-level Sev2 alert for control plane unresponsive or widespread DatabaseNotReachable. Discovered only via customer report.

6. Repairs

No repair items documented in incident record.

Feedback: (placeholder)
Work Items:
Service-level alert for DatabaseNotReachable; escalation from Sev4 to Sev2 when multiple VMs unhealthy
INC 51000001039329
Sev2 Mitigated

Title: APIM Endpoint with File Transfer Fails when Cached after service upgrade 0.50.x → 0.51.x
DRI: Gleb Feoktistov (glfeokti) / Tom Kerkhove (tomkerkhove)
Created: 2026-05-27
Mitigated: 2026-05-27 02:13 UTC
TTM: ~1h
Related: INC 807137164 (Emerging Issue)

1. Customer Complaint

File transfer endpoints failing after APIM service upgrade from 0.50.x to 0.51.x. Cached responses returning truncated/corrupted data for file downloads.

2. What customer saw

File transfers through APIM returning incomplete/corrupted data when served from cache. Issue appeared after gateway upgrade to 0.51.

3. CSS telemetry

Cache truncation at ~2 MiB boundary for responses exceeding cache size limit. Related to BufferingStreamBase.cs bug in 0.51 release.

4. Diagnosis & Fix

0.51 cache truncation regression. Same root cause as emerging issue INC 807137164. Partial content cached instead of skipping cache entirely when response exceeds 2 MB limit. Mitigated by service quarantine/rollback.

5. Monitoring Miss?

Silent truncation — no observable failure signal.

6. Repairs

See INC 807137164.

Feedback: (placeholder)
Work Items:
See INC 807137164
INC 807137164
Sev2 Active
Emerging Issue — 0.51 Release

Title: Gateway behavior change — backend response >2 MiB with 0.51 Release
DRI: Zhongren (zhonren) / Macko Treder (mackotreder)
Created: 2026-05-28
Service: apim-uks-prod-shr-1001 (S500)
Impact Start: 2026-05-06
Related: INC 21000001017622

1. Customer Complaint

Multiple customers reporting cached responses returning truncated data (~2 MB instead of full response) after gateway upgrade to 0.51. S500 customer impact. File transfers and large API responses corrupted.

2. What customer saw

Responses that should be >2 MB returned as ~2 MB when served from built-in cache. Silent truncation — no error codes. Only affects built-in cache path; external Redis unaffected.

3. CSS telemetry

Regression tied to 0.51 release. ~2 MB truncation boundary. Impact start 2026-05-06. Release halted. Rollback initiated for affected services.

4. Diagnosis & Fix

Bug in BufferingStreamBase.ReadInternalAsync(): When response exceeds 2 MB cache limit, hitLimit = true is set but cacheStream is NOT cleared — retains partial bytes. OnCompleted() then caches the partial content. Subsequent requests get truncated cached response.

Why only 0.51: v0.49 consumed response in large enough reads that first read exceeded limit (cacheStream stayed empty). v0.51 changed HTTP client streaming to smaller chunks, accumulating partial data before limit triggers.

Fix: Null/dispose cacheStream when hitLimit set. Release halted, rollback ACIS issued for SKUv1 and SKUv2.

5. Monitoring Miss?

Yes. Silent truncation means no observable failure signal. No validation comparing cached vs actual response size.

6. Repairs

• Fix BufferingStreamBase.ReadInternalAsync() to clear cacheStream on hitLimit
• Add cache integrity validation
• Release gate for cache-size boundary testing

Feedback: (placeholder)
Work Items:
Fix pending — release halted
INC 51000001041852
Sev2 Mitigated

Title: Scale Out Errors — XP Inc.
DRI: Zhongren (zhonren) / Macko Treder (mackotreder)
Created: 2026-05-28
Mitigated: 2026-05-29 01:21 UTC
TTM: 9h 47m
Customer: XP Inc. (XP Investimentos, ACE)
Service: xpi-prd-apim (Premium, Brazil South)

1. Customer Complaint

XP Inc. (XP Investimentos, ACE-level customer) reported multiple scale-out errors on Premium APIM service xpi-prd-apim in Brazil South. Azure Monitor alert fired: Microsoft.ApiManagement/service/write failed with "Unable to Update API service with vnet injection at this time" (ResourceOperationFailure). Autoscale stuck in failure loop. 3rd recent similar event with high executive visibility.

2. What customer saw

Repeated scale-out operation failures with ResourceOperationFailure: "Unable to Update API service with vnet injection at this time." Number of Machines metric showed 30 machines allocated, yet service continued attempting and failing to scale out. Azure Monitor alert fired at 2026-05-28T11:51:05Z.

3. CSS telemetry

Service container showed AllocatedSkuUnitCount: 3 (frozen) despite 30 healthy VMs running. Orchestration table: continuous loop of ScaleVmScaleSetInRegionFailed, UpdateOrchestrationFailedToChangeDeployedSku, UpdateRegionSkuOrchestrationFailed. SRE Agent confirmed 32 scale failures across 16 correlation IDs over ~5.3h (10:00–15:21 UTC). Capacity: 15 units / 30 machines allocated and healthy, but orchestration could not reconcile. Data plane unaffected (~10–20M req/hr normal).

4. Diagnosis & Fix

VMExtensionProvisioningError — DSC (ApimBootstrapperService) timed out on new VMSS instances after 6 retries, blocking RP UpdateRegionSkuOrchestration. Caused VMSS/service-container desync: Azure Autoscale scaled VMSS to 30 VMs (15 units) directly, but service container stayed at AllocatedSkuUnitCount: 3. Each reconciliation failed → oscillation between Failed/Updating. Transferred to Platform for DSC investigation. Incident self-recovered without manual intervention.

5. Monitoring Miss?

Yes. No LSI filed for scale-out failures or orchestration loops for xpi-prd-apim within 72h before CRI. Scale-out failure loop ran 5+ hours before customer reported. No platform-side monitoring detected sustained orchestration failure loop.

6. Repairs

• DSC failure investigation requested (zhonren asked cojih to investigate bootstrapper timeout)
• RCA requested by customer (3rd recurrence, high exec visibility)
• Prior similar: IcM 51000000995053
• Root cause: Service/VM Issue

Feedback: (placeholder)
Work Items:
DSC investigation; RCA for customer; recurring pattern fix
INC 51000001044102
Sev2 Mitigated

Title: Policy update fails with “Policy size exceeds allowed limit of -1 KB”
DRI: Zhongren (zhonren) / Macko Treder (mackotreder)
Created: 2026-05-29
Mitigated: 2026-05-29 23:12 UTC
TTM: 5h 25m
Service: apim-jet-stg (Standard v2, Sweden Central)
Customer: JetBank Albania / Backbase BVA
Blast Radius: Multiple services on scaleunits 003 & 004

1. Customer Complaint

JetBank Albania / Backbase BVA completely unable to update any APIM policies on Standard v2 instance (apim-jet-stg) in Sweden Central. All policy updates via portal, az CLI, and REST API failed with HTTP 400: "Policy size exceeds allowed limit of -1 KB." Blocking imminent go-live for new banking platform, 10+ stakeholders blocked, estimated $1M+ financial impact.

2. What customer saw

HTTP 400 Bad Request with ValidationError: "Policy size exceeds allowed limit of -1 KB" on all policy updates (product-level, API-level, global) — even minimal single-header changes. All update methods blocked (portal, CLI, REST API). Policy fragment updates continued working (HTTP 200 OK).

3. CSS telemetry

HTTP 400 responses in ManagementKpi and HttpIncomingRequests showing the validation error across SMAPI scale units api-sec-prod-scaleunit-003 and 004. Scale-unit-wide configuration issue affecting all 6 active SMAPI instances. Multiple services impacted beyond customer's (apim-jet-stg, apim-dtapim-prd-1wkuu-pv2, apim-apimanager-dt-sc-01, others). Failures first appeared 2026-05-28. Engineer (sasolank) traced regression to DeployApp orchestration on scaleunit-003 ~2026-05-28T12:45 UTC.

4. Diagnosis & Fix

Integer overflow in entity limit custom settings. GetMaxPolicySize() reads LimitsMaxPolicySizeKb and multiplies by 1024. A deployment set value to 2147483645, which × 1024 caused integer overflow wrapping policy size limit to -1 KB. Fix: Updated custom settings across all affected scale units (001–004) to set Microsoft.WindowsAzure.ApiManagement.Mapi.Limits.Entities.Policies.SizeKb to "2000000" (no overflow), then restarted webapps.

5. Monitoring Miss?

Yes. No LSI or alert filed before CRI. No monitoring for invalid (negative) policy size limit configuration or resulting HTTP 400 spike on policy PUT operations. Discovered only via customer support request.

6. Repairs

• Root cause: SMAPI — integer overflow in entity limit custom settings
• Process improvement: "Use ACIS action to update SMAPI instead of portal" to avoid overflow-prone values
• No formal repair items documented beyond immediate fix

Feedback: (placeholder)
Work Items:
ACIS-only SMAPI updates; overflow validation guard
INC 51000001033777
Sev2 Resolved
MCSAP/ACE Customer

Title: Unplanned Schedule upgrade — CDW
DRI: Gleb Feoktistov (glfeokti)
Created: 2026-05-22
Mitigated: 2026-05-23 02:02 UTC
TTM: 4h 31m
Customer: CDW (MCSAP/ACE)
Service: CDW-USNCZ-NPD-APIM (Premium, North Central US)

1. Customer Complaint

CDW (MCSAP/ACE account) reported unplanned scheduled upgrade on CDW-USNCZ-NPD-APIM (Premium, North Central US) causing API call failures impacting multiple teams. Customer highly sensitive due to recent bad Azure support experiences.

2. What customer saw

API calls failing across multiple teams. Traffic collapsed from ~483K req/24hr to 60 requests. Service went from 50% capacity to zero when final pre-upgrade instance (gwhost_7) was replaced ~May 22 17:00 UTC.

3. CSS telemetry

Platform upgrade 0.49→0.50 triggered VMSS rolling upgrade. gwhost_7 succeeded but second VM slot consistently failed: VMExtensionProvisioningTimeout on DSCConfiguration. 7 replacement VMs over 30h, each failing identically. Gateway Agent found ConfigInitialSyncFailed with ArgumentNullException in Api.TryUpdateRouting() — null routing key from API revision (function-app-mcp-server;rev=1). 16,000+ sync failures.

4. Diagnosis & Fix

Null-key handling bug in Api.TryUpdateRouting() (v0.50.27283.0). Platform upgrade regression caused gateway config sync failures. CSS followed SRE Agent recommendations. Service rolled back to 0.50 version that restored functionality.

5. Monitoring Miss?

Yes. First ConfigInitialSyncFailed at May 21 11:00 UTC — 30 hours before customer reported. Service degraded 2→1→0 instances with 7 consecutive VM failures. No alert fired.

6. Repairs

• Fix null-key regression in Api.TryUpdateRouting() (Gateway.Model/Api.cs:674)
• Fix function-app-mcp-server;rev=1 API null Method/route
• Investigate VNET NSG/firewall blocking DSC extension
• Add alerting for persistent ConfigInitialSyncFailed and single-instance degradation

Feedback: (placeholder)
Work Items:
Null-key fix; DSC investigation; config sync alerting
INC 21000001035735
Sev2 Mitigated

Title: Cannot change policy rate limit in APIM
DRI: Gleb Feoktistov (glfeokti)
Created: 2026-05-23
Mitigated: 2026-05-23 11:29 UTC
TTM: 3h 43m
Service: apim-HubCommon-az-asse-prd-001 (VNet-injected)

1. Customer Complaint

Customer unable to change rate limit policy on product scope. Modifying rate-limit-by-key from 60 to 80 calls appeared to succeed but silently reverted to 60 upon re-opening.

2. What customer saw

Portal showed no error on save but value not persisted — reverted to 60. Customer had owner access. Event logs showed HTTP 500 on save operations.

3. CSS telemetry

SRE Agent found SMAPI could not authenticate to SQL: 600,000+ DatabaseNotReachable events/hour with SqlException: Login failed for user '' (empty username, SQL Error 18456). Failure ongoing 24+ hours (since ~May 22 08:00 UTC). Gateway data plane unaffected — all 16 instances serving 2–3M req/hr normally.

4. Diagnosis & Fix

Managed Identity token refresh stopped working. Database connection string missing from service container. MI token no longer being refreshed, preventing SMAPI SQL auth. Fix: Restore database connection string in service container.

5. Monitoring Miss?

Yes. SQL auth failure (600K+ errors/hr) ongoing 24+ hours before customer reported. No alert or LSI filed. Discovered only via customer support case.

6. Repairs

No repair items documented.

Feedback: (placeholder)
Work Items:
MI token refresh monitoring; DatabaseNotReachable alerting
INC 51000001038159
Sev2 Resolved

Title: API Management service down — Network connectivity
DRI: Gleb Feoktistov (glfeokti)
Created: 2026-05-26
Mitigated: 2026-05-26 20:40 UTC
TTM: 4h 10m
Service: apiRyderDev (Premium, East US)
Duration: ~3 days (May 23–26)

1. Customer Complaint

Complete connectivity loss to Premium APIM service (apiRyderDev, East US) starting ~May 23 05:00 UTC. Service endpoints not enabled for recommended services. Management plane unavailable causing application outage.

2. What customer saw

Management plane completely lost. Portal reported service endpoints not enabled. API traffic dropped to zero. "Apply Network Configuration" resolved display error but did not restore connectivity. Both instances unreachable.

3. CSS telemetry

Bootstrapper stuck in restart loop. Service upgraded May 21 causing massive ConfigInitialSyncFailed (0→~1,992/day) with ArgumentNullException: key at Dictionary.FindEntry → Api.TryUpdateRouting(). Automated rollback to 0.49 also failed (UpgradeOrchestrationFailedToRollBack). Both VMs unhealthy, DSC extension timed out.

4. Diagnosis & Fix

Failed platform upgrade + failed automated rollback. Upgrade on May 21 broke gateway (ConfigInitialSyncFailed/null key). Rollback to 0.49 also failed. Both VMs unhealthy for ~3 days. Fix: Manually provisioned new VMs on 0.50 from Azure Portal.

5. Monitoring Miss?

Likely Yes. Potentially related LSI (IcM 802390488) filed May 21 but unclear if it covered apiRyderDev specifically. Service impacted ~3 days before CRI filed. No service-specific alert fired.

6. Repairs

No repair items documented. Customer requested RCA. Root cause: Gateway (Managed).

Feedback: (placeholder)
Work Items:
RCA requested; per-service monitoring for prolonged outage
INC 21000001040966
Sev2 Resolved

Title: PremiumV2 APIM Unavailable in UK South
DRI: Srajagrawal
Created: 2026-05-27
Mitigated: 2026-05-27 15:36 UTC
TTM: 5h 12m
Customer: S500-level
Service: dcw-apim-prod-integration-uks-01 (PremiumV2, UK South)

1. Customer Complaint

S500 customer tried to create PremiumV2 APIM service (dcw-apim-prod-integration-uks-01) in UK South via Terraform. Error: SKU not available in region. Customer aware of documented temporary limitation, asked when it would be lifted. Blocking their deployment.

2. What customer saw

Terraform request rejected: ApiServiceCreationDisabledForSubscription — "Creation of new PremiumV2 API Management services in UK South is not available at the moment." Request never reached RP orchestration — blocked at ARM validation.

3. CSS telemetry

SRE Agent confirmed PremiumV2 infra IS available in UK South (93 active services, 4 I2v2 resource pools with 820–858 available units). However, PremiumV2 activation telemetry showed 99.4% failure rate (102 successes vs 16,161 failures/90d) — pre-provisioning pipeline constrained since ~Apr 29. Customer's attempt: 0 rows in Orchestration table — blocked at MaxApimServicesCountPerSkuPerSubscription beta feature flag (PremiumV2 = 0).

4. Diagnosis & Fix

Subscription-level beta feature flag blocking creation. MaxApimServicesCountPerSkuPerSubscription had PremiumV2 set to 0 for customer's subscription. Fix: srajagrawal whitelisted 3 customer subscriptions to allow 1 PremiumV2 each. Customer confirmed successful deploy.

5. Monitoring Miss?

Partial. Related Sev3 LSIs for ActivateSkuV2 unhealthy orchestrations filed prior (IcMs 786658173, 787272459). But those tracked pre-provisioning failures, not per-subscription blocks. Customer-facing creation block not monitored.

6. Repairs

No repair items documented.

Feedback: (placeholder)
Work Items:
Monitor per-subscription creation blocks; capacity planning for UK South PremiumV2
✓ ALREADY REVIEWED 39 incidents — Click week headers to expand

These incidents have already been reviewed. Kept for reference.

Incident / DRI / StatusQuestions & AnswersFeedback & Work ItemsFollow-up
INC 796496317
Sev2 Mitigated
By Design

Title: Capacity Exception Request UK South PremiumV2 (DVSA)
DRI: Shubham (shubhash)
Created: 2026-05-12
Mitigated: 2026-05-12 17:06 UTC
TTM: 1h 55m

1. Customer Complaint

Driver and Vehicle Standards Agency requested PremiumV2 whitelisting in UK South for 5 subscriptions (dev/sit/uat/preprod/prod).

2. What customer saw

Unable to deploy PremiumV2 in UK South without capacity exception.

3. CSS telemetry

Standard capacity exception request via CSS template.

4. Diagnosis & Fix

By Design. Customer directed to use proper capacity request template. Mitigated by shubhash.

5. Monitoring Miss?

No — process issue, not product issue.

6. Repairs

N/A.

Feedback: (placeholder)
Work Items:
None
INC 51000000991822
Sev2 Mitigated

Title: MCP tools schema changed
DRI: Omar / Ajinkya / Nicholas
Created: 2026-04-21
Mitigated: 2026-04-24

1. Customer Complaint

MCP tools/list response schema changed, breaking customer integration. Nested objects in schema being flattened — e.g., expected {"location": {"type": "string"}} but received {"location_type": "string"}.

2. What customer saw

Different payload structure post-upgrade. MCP tool definitions returned flattened properties instead of nested objects.

3. CSS telemetry

Schema mismatch confirmed. CSS engineer (tehnoonr) independently identified as "a regression in MCP server implementation with the latest build — nested objects in the schema are being flattened." Upgrade telemetry confirmed version change on 2026-04-17 at 00:58:35 UTC.

4. Diagnosis & Fix

Breaking change in v0.51.3757.0 — two bugs:
(1) Intentional schema unwrapping in OperationExtensions.GetInputSchema()
(2) Body reconstruction bug in InvokeToolHandler.ExecuteMethodAsync() where newPayload is overwritten per iteration.

Rollback to v0.50.3674.0 + service quarantine. Ashendre mitigated Apr 24.

5. Monitoring Miss?

Yes — no API contract tests to detect breaking schema changes in MCP tool definitions before release.

6. Repairs

• Schema versioning for MCP tool definitions
• Contract tests for MCP tools/list response structure to prevent future regressions

Feedback: (placeholder)
Work Items:
To Do Schema versioning for MCP tool definitionsTo Do Contract tests for MCP tools/list
INC 51000001002629
Sev2 Mitigated

Title: UNIFIED STRATEGIC | Intermittent connection failure
DRI: Ethan
Created: 2026-04-30
Mitigated: 2026-04-30 02:43 UTC
TTM: 1h 38m
Customer: Strategic/Unified
Service: enterprise-int-apim-prod (Premium, Central US)

1. Customer Complaint

Customer reported intermittent connection failures affecting ~20% of requests from AKS cluster to APIM service "enterprise-int-apim-prod" (Premium SKUv1, Central US). Failing with ECONNREFUSED 10.12.2.228:443 when calling Salesforce API through APIM. Issue began after OS upgrade on 2026-04-29 ~12:00 UTC.

2. What customer saw

GraphQL operation failures in AKS app (msol-content-bff): "connect ECONNREFUSED 10.12.2.228:443." IP is the APIM Internal Load Balancer VIP. Path: AKS (separate VNET/sub) → Hub VNET with Palo Alto firewall → APIM Internal VNET. TCP-level RST packets prevented requests from reaching gateway.

3. CSS telemetry

CSS queried ProxyRequest for the failing URL and found only HTTP 200/204 — failing requests never reached APIM gateway. CRP telemetry confirmed VirtualMachineScaleSets.AutoOSUpgrade.POST upgrading to Windows Server 2022 image 20348.5020.260413. Network traces showed bidirectional RST packets — firewall saw resets from LB IP 10.12.2.228, APIM nodes saw resets from AKS pod IP 10.17.107.7.

4. Diagnosis & Fix

Faulty VM (gwhost_02) after OS upgrade. SRE Agent confirmed all 4 VMs healthy post-upgrade with continuous heartbeats. DRI (ethanlao) broke down ClientConnectionFailure by RoleInstance revealing almost all failures from gwhost_02. ReplaceVM initiated via Geneva Action (enterprise-int-apim-prod_ManageCompute_ead74155). After replacement, ClientConnectionFailures ceased. Mitigated.

5. Monitoring Miss?

Yes. No LSI filed before CRI. Per-VM ClientConnectionFailure pattern on gwhost_02 not detected by existing monitoring. DRI feedback: "SRE Agent did not check telemetry by role instance, which clearly shows faulty node." Per-VM error distribution monitoring could have caught this earlier.

6. Repairs

Transferred to Platform for RCA. Root cause: "Service/VM Issue."
DRI improvement: SRE Agent should check telemetry by RoleInstance to identify faulty nodes.
No additional repair items documented.

Review Note: Requires review — faulty VM, moved to platform for RCA
Work Items:
To Do Per-VM error distribution monitoring
INC 51000001005278
Sev2 Mitigated

Title: Requests from APPGW to APIM - 504 Timeout
DRI: Ethan
Created: 2026-05-01
Mitigated: 2026-05-02 01:08 UTC
TTM: 1h 58m
Service: apim-prod-eu-01 (Premium, Internal VNET, WEU)

1. Customer Complaint

Customer reported intermittent 504 timeout errors from their Application Gateway (agw-apim-prod-eu-01-pri-int) when making requests to APIM service (apim-prod-eu-01) in West Europe. Issue began ~2026-04-28T23:18 UTC. Failing calls never reached APIM.

2. What customer saw

Intermittent HTTP 504 (Gateway Timeout) from Application Gateway routing to Premium multi-region internal VNet APIM. Successful requests worked normally. Failed requests returned 504 with no entries in APIM GatewayLogs — requests did not reach APIM. Issue specific to West Europe region.

3. CSS telemetry

CSS confirmed no ProxyRequest entries for failed requests (only 200/204 for successful ones). P97 latency under 400ms (no backend slowness). Error reasons >99% empty. However, identified increase in ClientConnectionFailure errors correlating with problem start and unhealthy patterns on APIM Internal Load Balancer (ILB) beginning when customer issue appeared.

4. Diagnosis & Fix

Faulty VM (gwhost_33). Customer-reported issue start time (2026-04-28 23:15 UTC) correlated with increase in ConnectionIdle errors in HttpSys and BackendConnectionFailures specifically on gwhost_33. ReplaceVM operation initiated (apim-prod-eu-01_ManageCompute_63b54556). After VM replacement completed, incident mitigated.

5. Monitoring Miss?

Yes. No LSI filed before CRI. gwhost_33 had ConnectionIdle and BackendConnectionFailure errors since 2026-04-28 but no automated alert fired. Issue discovered only via customer support case filed 2026-05-01 — approximately 3 days after issue began.

6. Repairs

No repair items documented.

Review Note: Requires review — faulty VM, moved to platform for RCA
Work Items:
To Do Alert on per-VM ConnectionIdle/BackendConnectionFailure
INC 21000001017622
Sev2 Active
Active → Sev3

Title: Inconsistent Responses Observed for Azure APIM Cache
DRI: Nima Kamoosi (nimakamoosi)
Created: 2026-05-11
Build: 0.51.27763.0
Scope: Built-in cache only (not external Redis)

1. Customer Complaint

Customer reports cached responses returning truncated data (~2 MB instead of expected ~7 MB). Silent truncation with no error codes or indicators. Impact on data integrity — customers unknowingly serving incomplete payloads downstream.

2. What customer saw

Responses that should be ~7 MB returned as ~2 MB when served from built-in cache. No error codes, no truncation headers — completely silent data loss. Only affects built-in cache; external Redis cache returns full responses correctly.

3. CSS telemetry

Regression tied to build 0.51.27763.0. ~2 MB truncation boundary suggests hardcoded buffer size or misconfigured limit introduced in that build. External Redis unaffected confirms issue is in the built-in (in-memory) cache path, not the caching policy logic itself.

4. Diagnosis & Fix

Cache truncation regression in build 0.51.27763.0. ~2 MB silent truncation boundary for built-in cache responses. Root cause likely in Proxy/Gateway.Policies cache provider or in-memory cache buffer sizing. Customer workaround: skip caching for responses >2 MB. Fix pending — offending commit not yet identified publicly.

5. Monitoring Miss?

Yes. Silent truncation means no observable failure signal. No validation exists to compare cached response size vs original response size. No error/warning emitted when response is truncated.

6. Repairs

• Fail-loud behavior: emit error/warning when cached response is truncated (never silently serve partial data)
• Identify and fix the ~2 MB buffer limit introduced in 0.51.27763.0
• Add cache integrity validation (compare stored size vs expected Content-Length)
• Proactive notification to customers who may be affected but haven’t reported

Feedback: Silent truncation is a serious data integrity issue. Consider Sev2 retention given customers cannot programmatically detect corruption. Need repair item for fail-loud on truncation.
Work Items:
Pending — awaiting fix identification
INC 797275109
Sev2 Resolved
Sev 2→3: incomplete info

Title: Bleu - API Management Portal - Acceptance Testing Failed
DRI: Javier (javierbo)
Created: 2026-05-13
Resolved: 2026-05-14

1. Customer Complaint

Bleu cloud acceptance testing — APIM Standard instance portal returning HTTP 503.

2. What customer saw

https://{instance}.portal.azure-api.sovcloud-api.fr returned "HTTP Error 503. The service is unavailable."

3. CSS telemetry

Incomplete information provided. Could not reproduce.

4. Diagnosis & Fix

Unable to Reproduce. Information provided was incomplete, preventing investigation. Mitigated by javierbo.

5. Monitoring Miss?

N/A.

6. Repairs

N/A.

Feedback: (placeholder)
Work Items:
None
INC 21000001021441
Sev2 Resolved
Sev2→3: Customer capacity

Title: CCF and High capacity | NTT
DRI: Javier Borrego (javierbo)
Created: 2026-05-13
Resolved: 2026-05-14 16:46 UTC
Duration: 23h 18m
Service: samurai-mdr-prod-northeurope-ff7fqjl6 (Basic SKU, North Europe)
Customer: NTT

1. Customer Complaint

High capacity and increased Client Connection Failures (CCFs) and timeouts across all APIs, severely impacting production workloads on a Basic SKU APIM service in North Europe.

2. What customer saw

Connection closed events, timeouts across all APIs. Azure Front Door in front of APIM showed timeouts on ping tests. Issue started ~10 AM Eastern May 13. Scaling out at 16:20 UTC did not immediately resolve CCFs.

3. CSS telemetry

AppLens detected capacity above 75%. SRE Agent analysis: 8x traffic surge over baseline (58K → 490K req/hr), CCF rates 22–27% uniform across ALL VMs (systemic, not per-VM fault). “events” API responsible for >90% of errors. Backend p95 latency 90–140s causing gateway to hold connections and saturate CPU. BackendConnectionFailure secondary to gateway overload. No platform deployment or VNET issues found.

4. Diagnosis & Fix

Customer capacity issue — Basic SKU saturated by 8x traffic surge.

Root cause: Basic SKU (1 unit, 2 VMs) overwhelmed by sustained traffic escalation over 3 days, peaking at ~490K req/hr on May 13. Backend services responding at p95 latency of 90–140s, causing gateway VMs to hold connections until CPU exhaustion. Clients then closed connections (CCF). Secondary BCF errors to backend 20.105.12.67:443 as gateway lost ability to maintain outbound connections.

Mitigation: Scale-out added VMs (gwhost_6 through gwhost_16), reducing 5xx from 10% to 0.23%. Customer upgraded backend service to higher SKU to resolve latency. Javier downgraded Sev2→Sev3 and resolved.

5. Monitoring Miss?

No — AppLens correctly detected capacity >75%. This is a customer workload/SKU sizing issue, not a platform failure.

6. Repairs

N/A — customer needs to upgrade from Basic SKU and optimize backend latency. Recommendations provided: evaluate Standard/Premium SKU, investigate “events” API traffic source, implement exponential backoff on client retries.

Feedback: Well-handled. SRE Agent triage rated "Effective" — correct root cause, good Kusto analysis, HIGH confidence hypothesis aligned with engineer conclusion. No improvements needed.
Work Items:
None — customer action required
INC 21000001014164
Sev2 Active
Active → Sev3

Title: Tool not visible when converting existing API | MCP conversion
DRI: Nicholas (nbarreca)
Created: 2026-05-08
Related: ICM 21000000966326

1. Customer Complaint

Converting OAS API (~150 resources) with deep nested allOf/anyOf/oneOf/$ref into MCP server — only 3–4 tools visible instead of full set. 100% failure rate on MCP tools/list endpoint. Customer platform blocked.

2. What customer saw

"No tools available" or MCP error -32001 (Request timed out). ClientConnectionFailure: connection unexpectedly closed. Conversion completed without error but silently dropped ~146 operations.

3. CSS telemetry

NullReferenceException at OperationExtensions.GetObjectDefinition():198 — definition.Properties is null when OpenAPI uses allOf/$ref composition. Stack trace confirms pre-fix binary deployed (method signature lacks HashSet<string> visited parameter). 46 errors on this service, 1,333+ errors across 18 services globally in 7 days.

4. Diagnosis & Fix

Known product defect — fix merged but NOT deployed. NRE in MCP gateway schema parser when allOf/anyOf/oneOf not resolved. Fix (Task 37471067 / ResolveCompositeDefinition()) authored by ondrejoprala Apr 13, merged to main ~late April. SKUv2 rollback to v0.50 means fix was lost. Workaround: flatten OpenAPI spec (replace allOf/anyOf with inline properties).

5. Monitoring Miss?

No monitoring miss per se — functional limitation. But no validation exists to verify all operations successfully converted, and no deployment tracking caught the fix regression.

6. Repairs

Task 37471067 (merged, awaiting deployment). Recommended: (1) Deploy fix urgently, (2) Add conversion validation comparing source op count vs tool count, (3) Surface warnings when operations silently dropped.

Review Note: Scope: Not isolated — 18 services globally, 1,333+ errors in 7 days. 100% failure rate on MCP operations for affected specs.

Timeline gap: Issue started Mar 13, original ICM 21000000966326 filed Mar 30, fix PR authored Apr 13, merged ~late April, but SKUv2 rollback to v0.50 means fix was lost. This CRI filed May 8 as regression.

DRI: Nicholas (nbarreca)
Work Items:
Task 37471067 - Fix MCP tools/list allOf/anyOf/oneOf (merged, not deployed)
INC 21000001014384
Sev2 Active
Active → Sev3

Title: APIM failed to upgrade and left in unhealthy state
DRI: Nima Kamoosi (nimakamoosi)
Created: 2026-05-08
Service: apiportaltst (WEU, Premium, External VNet)
Customer: NS.NL (Netherlands Railways)
Duration: 9+ days stuck

1. Customer Complaint

APIM service failed during platform upgrade, left in unhealthy state for 9+ days. Hundreds of internal users blocked from dev/testing. Premium service, management endpoint unreachable. Customer escalated multiple times.

2. What customer saw

Service stuck in "Updating" state. Management endpoint unreachable. "Apply Network Settings" also fails. Portal shows unhealthy. No admin operations possible.

3. CSS telemetry

Platform-initiated rollback (0.51.27763.0 → 0.49.25546.0) triggered by Sev1 INC 788655236 (DevPortal). Bootstrapper on target version cannot find machine certificate 916F5EED... — 8 failure cycles (30-min timeout each) over 5 hours. VMSS rolling upgrade failed: 100% unhealthy instances. Orchestration locked permanently.

4. Diagnosis & Fix

Failed platform rollback — certificate incompatibility between versions. Target version 0.49.x requires cert not available on reimaged VMs (provisioned by 0.51.x only). Nima did ForceRuntimeUpgrade to 0.51.28257.0 (retake) on May 12 — partially recovered (1 instance up). Also found 3 tooling bugs: ACIS JSON deserialization error in State=4, NotEligible without reason, wrong version when Release Channel ≠ All.

5. Monitoring Miss?

Yes. (1) Stuck orchestration not detected — no alert for services in failed upgrade state. (2) No auto-recovery mechanism. (3) Collateral damage from Sev1 rollback batch not validated per-service before execution.

6. Repairs

No repair items linked. Recommended: (1) Alert on services stuck in Upgrading >2h, (2) Auto-recovery for failed rollbacks, (3) Validate cert compatibility before cross-version rollback, (4) Fix 3 infra/tooling issues found by Nima.

Review Note: Collateral damage from Sev1 (788655236) rollback. Service stuck 9+ days. Preview release channel auto-included in rollback batch without per-service validation.

Nima found 3 tooling issues during mitigation: ACIS JSON deserialization bug, NotEligible without explanation, wrong version without Release Channel=All.

Key question: How do we prevent rollback batches from breaking services with cert incompatibility?

DRI: Nima Kamoosi
Work Items:
None — repair items recommended
INC 51000000976163
Sev2 Mitigated

Title: TCP connection failed between on-prem to APIM service
DRI: Maxim Agapov / Tuan Nguyen
Customer: DTE Electric Company (ACE)

1. Customer Complaint

DTE Electric Company reported timeout/connection failures from on-prem to APIM (Premium, Internal VNet, Central US). 50% of API calls failing, causing payment failures for hundreds of customers per minute.

2. What customer saw

Intermittent TLS connection failures — TCP handshake succeeded but TLS handshake never completed. gwhost_1 did not respond to TLS ClientHello.

3. CSS telemetry

PCAPs: GW Host 1 resetting TCP sessions. DNS/CRL/AIA endpoints unreachable on gwhost_1 only. Infrastructure-layer per-VM networking degradation.

4. Diagnosis & Fix

Per-VM networking degradation on gwhost_1. Couldn't reach cert validation servers. Auto-healing replaced VM at 03:48 UTC Apr 9. Impact: ~16.5h.

5. Monitoring Miss?

Yes — health checks don't probe per-VM TLS handshake. Auto-healing took ~16h.

6. Repairs

Enhanced LB health probes with TLS validation. Faster auto-healing. CRL soft-fail resilience.

Notes

Central US datacenter infrastructure stress. Related IcMs for same region connectivity issues: 775500895, 775342041, 776075200, 776236264

Feedback: (placeholder)
Work Items:
New Bug 37466473 - Self-service gateway host mitigationNew Bug 37466169 - Telemetry to detect gateway host failures
INC 21000000991208
Sev2 Active
Sev 2->3

Title: Self-Hosted Gateway hangs ~10 min
DRI: Mahsa Sadi
Created: 2026-04-20

1. Customer Complaint

API calls via SHG hang until client timeout (~10 min).

2. What customer saw

Requests hang indefinitely. Path: Client-AWS ELB-SHGW-Netskope-Backend.

3. CSS telemetry

GatewayV2 timing out on large headers. Backend CSP header 7.7KB exceeds HTTP/2 HPACK MAX_HEADER_LIST_SIZE default (8192).

4. Diagnosis & Fix

Code Bug - HTTP/2 HPACK limit 8192 in TcpChannelInitializer vs 65536 for HTTP/1.1. Fix ETA: 2 months.

5. Monitoring Miss?

No - product bug.

6. Repairs

Override Http2Settings to 65536. Expose net.client.http2.max-header-list-size.

Feedback: (placeholder)
Work Items:
Fix HPACK limit
INC 51000000996010
Sev2 Active
Sev 2->3: customer issue

Title: HM Electronics SSL CERT_VERIFY_FAILED
DRI: Macko Treder
Created: 2026-04-23

1. Customer Complaint

HM Electronics (Sev A, Premium) - intermittent SSL cert verification failures.

2. What customer saw

Intermittent SSL CERT_VERIFY_FAILED on backend calls.

3. CSS telemetry

Classified as customer issue.

4. Diagnosis & Fix

Customer issue. Downgraded to Sev3.

5. Monitoring Miss?

N/A.

6. Repairs

N/A.

Feedback: (placeholder)
Work Items:
None
INC 51000000996953
Sev2 Active
Sev 2->3: reporting issue

Title: APIM does not scale out
DRI: Martin Dechev
Created: 2026-04-24

1. Customer Complaint

Customer reported unable to scale out.

2. What customer saw

Scale-out appeared broken.

3. CSS telemetry

Customer actually has 60 instances. Problem is orchestration logs/container size reporting.

4. Diagnosis & Fix

Not a scaling issue - reporting/logging problem. Martin investigating.

5. Monitoring Miss?

No (reporting issue).

6. Repairs

Fix orchestration log reporting.

Feedback: (placeholder)
Work Items:
None
INC 21000000998761
Sev2 Resolved

Title: APIM service is down
DRI: Macko Treder / Gleb Feoktistov
Created: 2026-04-25

1. Customer Complaint

Customer reported APIM service completely down.

2. What customer saw

Service unreachable, all traffic returning 500s.

3. CSS telemetry

Auto OS rolling upgrade triggered destructive VMSS model update from scale-out. All VMs lost Redis. Health monitor deadlocked.

4. Diagnosis & Fix

Scale-out triggered destructive VMSS upgrade. ~7h outage. Macko mitigated with Reboot Apr 25.

5. Monitoring Miss?

Yes - cascading failure not caught until customer reported.

6. Repairs

Rolling upgrade guardrails. Redis resilience. Reboot as preferred first mitigation.

Feedback: (placeholder)
Work Items:
None
INC 51000001003804
Sev2 Mitigated

Title: APIM Scale out failure
DRI: Maxim A / Ethan Lao
Created: 2026-04-30
Service: apim-ads-cus-entbusops-prd-001 (Premium, Central US)
Impact: ~5,000 employees affected
Activations: 4 distinct activations

Activation 1 (Apr 30 16:39 – 18:40 UTC)

1. Customer Complaint

Scale-out operation for Premium APIM service in Central US failing consistently. Previously working fine. ~5,000 customer employees affected by inability to scale. Service running at capacity above 75%.

2. What customer saw

Service failed to scale. Service running at capacity above 75%. Spike of 5xx.

3. CSS telemetry

Failed scale orchestration.

4. Diagnosis & Fix

Scale operation failed: Error Message: VM 'gwhost_1431' has not reported status for VM agent or extensions. No log in ApiSvcHost. Replace node 1431. Scale Completed.


Activation 2 (Apr 30 20:21 – 22:50 UTC)

1. Customer Complaint

Issues with functional Service new spike of 5xx.

2. What customer saw

Service new spike of 5xx.

4. Diagnosis & Fix

Node constantly report "Timer_ConnectionIdle". Replace node 1450. Scale completed. gwhost_1453 has higher number of client connection failures than other nodes — Replaced.


Activation 3 (May 1 07:55 – 10:51 UTC)

1. Customer Complaint

Issues with functional Service new spike of 400/429.

2. What customer saw

Service new spike of 400/429, Failed requests.

4. Diagnosis & Fix

Traffic burst to "authenticationhelperapi" + ratelimiting policy blocked ~25% of the traffic. 429 and 400 return codes. Increased Limit to unblock. Then returned limit. Manually scale.


Activation 4 (May 1 12:09 – 13:54 UTC)

1. Customer Complaint

Issues with functional Service new spike of 5xx.

2. What customer saw

Service new spike of 5xx.

4. Diagnosis & Fix

All nodes returned 5xx. Not APIM Issue.

Feedback: (placeholder)
Work Items:
None
INC 51000000976518
Sev2 Mitigated

Title: 408 request time out errors
DRI: Macko

1. Customer Complaint

Intermittent 408 request timeout errors on Premium SKU v1 service.

2. What customer saw

Clients received 408 timeouts. Requests never appeared in ProxyRequests logs.

3. CSS telemetry

PCAPs: client reusing source ports too quickly. SYN packets with reused ephemeral ports in TIME_WAIT state.

4. Diagnosis & Fix

Customer Error. Client reusing source ports aggressively with improper connection pooling. Customer advised to fix. Macko mitigated Apr 9.

5. Monitoring Miss?

No — client-side issue.

6. Repairs

N/A.

Feedback: (placeholder)
Work Items:
None
INC 51000000976727
Sev2 Mitigated

Title: Issue in connecting to APIM workspace gateway
DRI: Rafal
Customer: Healthcare (S500)

1. Customer Complaint

Healthcare S500 — workspace gateway cert expired, 5AM data sync failed. Mission-critical outage.

2. What customer saw

SSL/TLS trust relationship failure on workspace gateway endpoint. Another workspace on same service working fine.

3. CSS telemetry

WorkspaceGatewayWebsiteSslCertificateItemDetails showed expired cert. No ProxyRequest logs (requests never reached gateway).

4. Diagnosis & Fix

Gateway (Workspace) — managed SSL cert expired, not auto-renewed. Tuan renewed cert + shallow update to rotate at-risk certs. Mitigated Apr 9.

5. Monitoring Miss?

Yes — no alert for workspace gateway cert expiration.

6. Repairs

Cert expiration alerting for workspace gateways.

Feedback: (placeholder)
Work Items:
None
INC 51000000976845
Sev2 Mitigated

Title: CX DTE Electric Company - host dropping traffic
DRI: Maxim A
Customer: DTE Electric Company

1. Customer Complaint

Second DTE service (dte-cu-prod-azure-apps-apim-prod) also dropping traffic from degraded gwhost_4.

2. What customer saw

Same pattern as 976163 — traffic drops after TCP handshake, before TLS.

3. CSS telemetry

Same infrastructure-level VM networking degradation. Both DTE services affected by same regional issue.

4. Diagnosis & Fix

VM replacement. Tuan replaced degraded VM Apr 10. Customer RCA delivered.

5. Monitoring Miss?

Yes — same gap as 976163.

6. Repairs

See INC 976163.

Feedback: (placeholder)
Work Items:
See INC 976163
INC 776736824
Sev2 Mitigated
ACE Declared Outage

Title: <ACE Declared Outage> DTE Energy host dropping traffic
DRI: Maxim A

1. Customer Complaint

ACE Declared Outage wrapper for DTE Energy issue (SR 2604090040004283).

2-4.

Same as INC 976845. Mitigated by Tuan Apr 10 at 16:20 UTC.

5. Monitoring Miss?

Yes — same.

6. Repairs

See INC 976163.

Feedback: (placeholder)
Work Items:
See INC 976163
INC 777722043
Sev2 Mitigated
Outage Declared

Title: Huge number of 500s (ExpressionValueValidationFailure on cache-value)
DRI: Macko
Customer: AOAI / Cognitive Services

1. Customer Complaint

AOAI team — HTTP 500 from ExpressionValueValidationFailure on cache-value policy. Impacted cognitivewcusprod (3,590 errors/3h).

2. What customer saw

HTTP 500 for management/update actions. cache-value refresh-after evaluated outside valid range [1, 2147483647].

3. CSS telemetry

Unintended upgrade to unsupported build via misconfigured orchestration + ForceUpgrade feature flag bypassing SDP.

4. Diagnosis & Fix

Service - Configuration. ForceUpgrade deployed unsupported version. Rollback + unlock stuck services + disable ForceUpgrade + quarantine. Rapopescu mitigated Apr 11.

5. Monitoring Miss?

Partially — ForceUpgrade bypassed release safeguards.

6. Repairs

Disable/restrict ForceUpgrade. Validate cache-value at compilation time.

Feedback: (placeholder)
Work Items:
New PBI 37511380 - ACIS quarantine AOAI Hub PBI 37511341 - Dedicated release channel
INC 778045793
Sev2 Mitigated

Title: Content safety timeout/failure
DRI: Tuan

1. Customer Complaint

Content Safety / AOAI API requests failing consistently after build rollback.

2. What customer saw

HTTP 408 timeouts. Systematic failures after onset.

3. CSS telemetry

APIM → RP (checkAccess) → connection failure → 408. Build rolled back but background refresh config remained → invalid combination.

4. Diagnosis & Fix

Invalid build + config combination after rollback. Disabled background refresh on affected services.

5. Monitoring Miss?

Validate config compatibility on rollback.

6. Repairs

See INC 777722043.

Feedback: (placeholder)
Work Items:
See INC 777722043
INC 21000000983047
Sev2 Mitigated

Title: Custom domain cert update failed - AzureFirstPartyServiceTag
DRI: Tom
Service: shared-apim-eas-prd-01 (East Asia)

1. Customer Complaint

Custom cert update failed: "Unable to Update API service with vnet injection." Cert expiring Apr 17.

2. What customer saw

IPTagsCannotBeModifiedForExistingStaticPublicIPAddresses — RP tried to add IP tags to existing static PIP.

3. CSS telemetry

BetaFeature for IPTags applied globally as wildcard, modifying existing PIPs (immutable).

4. Diagnosis & Fix

RP Regression. Tom removed AzureFirstPartyServiceTag config Apr 14. Resolved Apr 16.

5. Monitoring Miss?

Yes — global wildcard rule not caught in deployment review.

6. Repairs

Fix BetaFeature to not apply IPTags to existing static PIPs.

Feedback: (placeholder)
Work Items:
Done PBI 37527374 - Log all code pathsDone PBI 37527339 - Disallow * scopeResolved Bug 37526292 - Block on 3P services
INC 51000000982831
Sev2 Mitigated

Title: APIM stuck on updating state for over 6 hours
DRI: Macko
Service: core-live-we-0f3d-apim (Premium, WEU)

1. Customer Complaint

APIM stuck in "Updating" for 6+ hours. All critical public-facing apps down due to expired cert.

2. What customer saw

Cert update triggered ~200min process that got stuck. VMSS deployment failed (Conflict). Only 2/6 VMs serving.

3. CSS telemetry

Initial VMSS failure at 09:51 UTC (3h15 after start). Retry stuck 2+ hours with no logs.

4. Diagnosis & Fix

RP — VMSS update stuck. Retry mechanism stuck with no logging. Macko mitigated Apr 15 with ad-hoc steps.

5. Monitoring Miss?

Yes — no alert for long-running stuck updates.

6. Repairs

Alert on operations exceeding expected duration with no progress.

Feedback: (placeholder)
Work Items:
None
INC 21000000984572
Sev2 Mitigated

Title: APIM default gateway cert expired – not auto-renewed
DRI: Tuan

1. Customer Complaint

Default gateway cert expired Apr 9. Microsoft-managed TLS cert not auto-renewed. Production down.

2. What customer saw

NET::ERR_CERT_DATE_INVALID. All HTTPS traffic blocked on default hostname.

3. CSS telemetry

Cert expired, entirely Microsoft-managed. Customer cannot renew/replace.

4. Diagnosis & Fix

Gateway (Managed) — auto-renewal failed. Tuan ran ACIS to renew Apr 15. Rafal resolved Apr 21.

5. Monitoring Miss?

Yes — no alert for managed cert approaching expiry.

6. Repairs

Cert expiration alerting for managed certs.

Feedback: (placeholder)
Work Items:
None
INC 779770640
Sev2 Mitigated

Title: Emerging Issue - Unable to deploy PremiumV2 in East US2
DRI: Tuan
Impact: ~March 26 (~3 weeks)

1. Customer Complaint

Multiple customers unable to deploy PremiumV2 in East US 2. Misleading error about activation limits.

2. What customer saw

"PremiumV2 SKU activation limit reached." Actual cause: ApiServicePrepoolExhaustedException (HTTP 503). Issue ~3 weeks old.

3. CSS telemetry

Kusto: HTTP 503 with ApiServicePrepoolExhaustedException in East US 2.

4. Diagnosis & Fix

Resource Pool Exhaustion. Dan closed East US 2 for PremiumV2 activations Apr 21. Some capacity for exceptions.

5. Monitoring Miss?

Yes — no alert for prepool exhaustion. Persisted ~3 weeks.

6. Repairs

Fix misleading error message. Proactive prepool exhaustion alerting per SKU/region. Capacity planning.

Feedback: (placeholder)
Work Items:
None
INC 780345525
Sev2 Mitigated

Title: APIM not sending Email notifications
DRI: Omar
Created: 2026-04-15
Impact: 2026-04-15 01:30

1. Customer Complaint

Multiple customers - APIM not sending any emails. Dev Portal password resets, invitations, admin notifications all broken.

2. What customer saw

No emails from apimgmt-noreply@mail.windowsazure.com. Flows completed in UI but no email arrived.

3. CSS telemetry

No failures in APIM email pipeline. No active SMTP outages. ~6 days before declaration.

4. Diagnosis & Fix

DDoS email flood - Free Trial subscriptions flooded emails during DDoS, SMTP flagged APIM address as spam. Dan fixed with hotfix Apr 21.

5. Monitoring Miss?

Yes - no email delivery success rate monitoring. ~6 days elapsed.

6. Repairs

Rate-limiting. Email delivery metrics. Drop-off alerting. Block FreeTrial emails.

Feedback: (placeholder)
Work Items:
Done Task 37580153 - Disable EMAIL Free TrialDone Task 37594340 - Throttling on QuotaDone Task 37631504 - Block FreeTrial emails
INC 780717094
Sev2 Resolved

Title: Australia East activations fail - DNS quota
DRI: Tom
Created: 2026-04-16
Customer: AOAI Hub

1. Customer Complaint

AOAI Hub - Workspace/Hub gateway activations failing in Australia East.

2. What customer saw

Activation failures. New gateways not provisioned.

3. CSS telemetry

DNS record quota exhausted (10k limit).

4. Diagnosis & Fix

Capacity - DNS team increased 10k to 30k Apr 16. Resolved Apr 23.

5. Monitoring Miss?

Yes - no DNS quota alerting.

6. Repairs

DNS quota monitoring at 80%. Cleanup deprovisioned gateways. Dedicated DNS zones for AOAI Hub.

Feedback: (placeholder)
Work Items:
Done PBI 37567190 - Alert DNS recordsDone PBI 37565539 - DNS in resource pools PBI 37566932 - Dedicated DNS zonesNew PBI 37660220 - DNS in RCM
INC 21000000989825
Sev2 Mitigated

Title: Workspace gateway not working
DRI: Omar (with Kriti)
Created: 2026-04-17
Customer: S500 Healthcare

1. Customer Complaint

S500 healthcare - workspace gateway unreachable, production impacted.

2. What customer saw

Gateway connectivity failure. Events not processed.

3. CSS telemetry

QueryEventsFailed: 403 AuthenticationFailed. Internal SAS token expired. Recurred on each upgrade (expired credential re-applied).

4. Diagnosis & Fix

Software defect - SAS URL expired, not auto-renewed. Upgrade re-deployed stale credential. Omar and Kriti worked together to resolve — Kriti refreshed SAS Apr 17.

5. Monitoring Miss?

Yes - no credential expiration monitoring.

6. Repairs

Code fix for credential renewal. Health checks for auth validity.

Review Note: Monitor missing — reviewed
Work Items:
Resolved Bug 37646918 - Event Table conn string
INC 51000001009573
Sev2 Mitigated

Title: Unwanted log entries in APIM Application Insights
DRI: Maxim A
Created: 2026-05-06
Mitigated: 2026-05-06 07:22 UTC
TTM: 1h 3m
Customer: 4 Premium services (Australia)
Related: ICM 21000001008018

1. Customer Complaint

Starting April 28, unwanted log entries for send-one-way-request policy and /ext_cap/v1/req_res_cap path appeared in Application Insights for every API call. Entries unrelated to business processes, causing confusion during debugging. Customer also requested quarantine of their 4 Premium tier services in Australia.

2. What customer saw

Unwanted dependency log entries in App Insights Failures blade for every API call. Entries showed send-one-way-request fire-and-forget calls logged as failed dependencies (ResponseCode=0, Success=false). Initially observed on Developer tier only; Premium tier services had not yet been upgraded to affected gateway version (0.51.x).

3. CSS telemetry

CSS confirmed unwanted log entries. SRE Agent verified via GatewayHeartbeat that all 4 Premium services were on version 0.50.27283.0 (pre-regression), while 0.51.x rollout was 50–70% complete across Australia regions. Root cause: gateway 0.51.x introduced new TrackOutgoingRequestDependencies() method in ApplicationInsightsLogPublisher.cs that tracks all policy-initiated outgoing requests as App Insights dependencies, including fire-and-forget calls.

4. Diagnosis & Fix

0.51 regression in Application Insights dependency tracking. Root cause established in related ICM 21000001008018. Gateway 0.51.x introduced per-policy outgoing request dependency tracking that logged send-one-way-request as failed dependencies (ResponseCode=0). Tom Kerkhove directed Maxim A to apply feature flag "logs.applicationinsights.dependency.legacy": "true" as custom settings on affected services. Settings applied, incident mitigated.

5. Monitoring Miss?

Yes. No LSI filed before CRI. App Insights dependency tracking regression in 0.51.x discovered only through customer reports. False-positive failed dependencies in customer App Insights not covered by platform monitoring.

6. Repairs

• Product fix needed: Exclude send-one-way-request from DependencyTrackingSources, or handle ResponseCode=0 as neutral (not failed) for fire-and-forget policies
• Continued tracking in ICM 21000001008018
• Quarantine impact: Feature flag blocks all 0.51 features. Gateway team to evaluate targeted fix backport to 0.50.x or fast-track in 0.52.x

Review Note: Does not need review
Work Items:
To Do Fix send-one-way-request dependency tracking in 0.52To Do Evaluate quarantine duration for Premium services
INC 21000000995987
Sev2 Active
Downgraded Sev3

Title: SSL CERTIFICATE_VERIFY_FAILED
DRI: Martin
Created: 2026-04-23
Customer: Network Rail (Mission Critical)

1. Customer Complaint

SSL certificate verification failures on backend calls.

2. What customer saw

SSL CERTIFICATE_VERIFY_FAILED intermittently on backend connections.

3. CSS telemetry

Transferred to Platform for deeper RCA.

4. Diagnosis & Fix

Result of migrating VMs to 1P image. Intermediate certificates (and CRLs) not immediately available because customer NSG blocks outbound port 80 (not configured as documented). Not expected to repeat unless VM replace/reimage without proper NSG configuration. Team actively rebooting VMs for services where bootstrapper complains about the chain. Unconfirmed but lack of intermediates likely temporary — AzSecPack script probably installs them in background.

5. Monitoring Miss?

TBD — related to broader certificate chain emerging issue pattern.

6. Repairs

Cannot suggest repair items as this is part of 1P migration. Active reboots for affected services. Customer must configure NSG properly (outbound port 80) to prevent recurrence on future reimage.

Review Note: Review during emerging issue since related to it
Work Items:
Part of 1P migration — no separate repair items
INC 787095677
Sev2 Mitigated
Outage Declared

Title: Grandfathered Limits not applied after 0.51
DRI: Macko
Created: 2026-04-27

See full incident details in ICM.

⚠ PIR Note: PIR Required on Thursday
INC 788655236
Sev2 Mitigated
Outage Declared

Title: Dev Portal Registration Fails - "User registration is not supported."
DRI: Maxim A
Created: 2026-04-30

See full incident details in ICM.

⚠ PIR Note: PIR required — next Thursday? Roman / Ondrej / Rafal
INC 789605450
Sev2 Active

Title: Intermittent 500s DotNetty.EncoderException (AOAI)
DRI: Ethan
Created: 2026-05-01

See full incident details in ICM.

⚠ PIR Note: Involve Tom for monitoring aspects
INC 791563622
Sev2 Mitigated
Emerging Issue

Title: Incomplete certificate chain for Gateway Endpoint After Upgrade
DRI: glfeokti / tehnoonr
Created: 2026-05-04
Mitigated: 2026-05-04 22:38 UTC
TTM: 1m 37s (tracking LSI)
Impact: Multiple customers (VNET-injected V1 SKU)

1. Customer Complaint

Multiple customers reported that after OS Upgrade maintenance, their APIM gateway endpoint presented an incomplete certificate chain. Filed as emerging issue (CSS-sourced) affecting classic/V1 SKU VNET-injected services where outbound TCP port 80 is blocked. Escalated Sev3→Sev2 due to multiple prior Sev2 CRIs.

2. What customer saw

Clients connecting directly saw incomplete cert chain and SSL handshake failures. Customers with AFD or AppGW in front of APIM received HTTP 502 errors. Manifested specifically after OS upgrade events on VNET-injected V1 SKU services.

3. CSS telemetry

AppLens detected capacity above 75% during impact. No additional telemetry findings beyond automated analysis.

4. Diagnosis & Fix

External/Customer Issue - VNET. Occurs on VNET-injected V1 SKU services where outbound TCP port 80 blocked. Customer fix: (1) allow outbound TCP port 80 to Internet, (2) click "Apply Network Settings" from Portal Network blade.

Note: "Apply Network Settings" alone temporarily mitigates until next OS upgrade — suggests Bootstrapper performs different cert chain assembly during reboot vs re-image. Mitigated as emerging issue tracking LSI since fix is documented customer config change.

5. Monitoring Miss?

Yes. No LSI or proactive alert before CRI. Multiple Sev2 CRIs filed by customers — consistently discovered through customer reports. No existing monitor detects incomplete cert chains on gateway endpoints after OS upgrades.

6. Repairs

• Investigate why Bootstrapper handles cert chain differently during re-image (OS Upgrade) vs reboot
• Create awareness of emerging issue pattern
• No formal repair items attached

Review Note: PIR ??
Work Items:
To Do Investigate Bootstrapper reimage vs reboot cert handlingTo Do Cert chain completeness monitor
INC 783036395
Sev2 Active
Emerging Issue

Title: Suspended Services do not recover after subscription renewal
DRI: Omar Macias / Kriti Majumdar
Created: 2026-04-20
Impact: Multiple customers stuck for days

1. Customer Complaint

Customer services were stuck and not recovering. Subscriptions re-enabled and services not coming back.

2. What customer saw

Service shown as Deleted/Suspended in Portal for days after customer being unsuspended. No actual error or banner.

3. CSS telemetry

Services not being created, no telemetry at all for 3 different CRIs. CSS opened an emerging issue.

4. Diagnosis & Fix

Undelete queue blocked by poison pills. With SRE Agent help identified that SubscriptionLifecycleOrchestration runs 5 steps: (1) WarnSuspendedContainers, (2) SuspendWarnedContainers, (3) SuspendAndWarnActiveContainers, (4) TerminateContainersWhoseSubscriptionIsUnRegisteredOrDeleted, (5) ActivateContainersForSubscriptionWhichGotRegistered.

Orchestration hardcoded to max 20 services per iteration. ~14 services persistently failing to undelete (SKUv2 RG quota, P1v3 pool exhaustion). With 12-16 failing per cycle + previous steps filling the bucket, no opportunity to process undelete queue.

Fix: Kriti unblocked 7 services failing with P1v3. Team unblocked SQL size and SKUv2 RG quota failures to free space. Nina made hotfix to skip poison pills so they no longer block queue completely.

5. Monitoring Miss?

Yes — No SLA or completion monitoring for undelete. No check on how long a service has been waiting to be undeleted. Undelete buried inside SubscriptionLifecycleOrchestration as the last possible step. Also found SKUv2 preprovisioning RGs with quota exhaustion (MSI 800/800, ServerFarms 100/100) — creating and leaving them regardless of undelete success/failure.

6. Repairs

Done: Skip hotfix rolled out behind beta feature (poison pill bypass)
Needed: Cleanup of quota-exhausted SKUv2 preprovisioning RGs
Needed: Decouple undelete from SubscriptionLifecycleOrchestration
Needed: SLA metric / Alert on undelete consistently failing

Feedback: (placeholder)
Work Items:
Done Skip hotfix (poison pill bypass)To Do Cleanup quota-exhausted SKUv2 RGsTo Do Decouple undelete from SubscriptionLifecycleTo Do SLA metric / Alert on undelete failing