On-Call Rotation (Sloop)
APIM Incident Manager
▶ INCOMING
Tom Kerkhove (Primary)
Maxim Kim (Primary)
SaiKiran Vukyam (Primary)
Samir Solanki (Primary)
Maxim Kim (Primary)
SaiKiran Vukyam (Primary)
Samir Solanki (Primary)
◀ OUTGOING
Tom Kerkhove (Primary)
Rafal Mielowski (Primary)
Shilpa Mani (Primary)
Rafal Mielowski (Primary)
Shilpa Mani (Primary)
US Sloop
▶ INCOMING
Zhongyuan Ren (Primary)
Bruce Moe (Backup)
Bruce Moe (Backup)
◀ OUTGOING
Gleb Feoktistov (Primary)
Zhongyuan Ren (Backup)
Zhongyuan Ren (Backup)
EU Sloop
▶ INCOMING
Macko Treder (Primary)
Kriti Majumdar (Backup)
Kriti Majumdar (Backup)
◀ OUTGOING
Srajan Agrawal (Primary)
Maxim Agapov (Backup)
Maxim Agapov (Backup)
Active Incidents
3 activeAI Summary
Azure Reliability Red Flag requiring APIM to ensure all service calls to dSTS/dSMS use first-party IPs covered by service-specific Service Tags. APIM has untagged IPs calling these sensitive endpoints. Not customer-impacting but mandatory compliance. ETA tag set to 2026-11-30 with 90-day SDP duration. Must submit Service Tag capacity requests by May 28.
Required Actions
1. Identify untagged IPs via Kusto query on
cdocinv.westus2 cluster. 2. Submit Service Tag requests by May 28. 3. Add AzRF.SDPInProgress tag by Jun 4. Must remain Sev2 — non-tracking escalates to EVP. ● MEDIUM — Compliance deadlineAI Summary
AzureSecurityPack blocked
zlib1.dll (Git mingw64) from executing due to Code Integrity policy violation. Unlike previous audit-only detections, this binary was BLOCKED. Remains active — requires binary update or policy exception. Related to 802334013 (mitigated with TSG). ● MEDIUM — Binary blocked, needs updateAI Summary
SKUv2 customer activation success rate dropped below 95% SLA threshold in West Europe with 3+ unique service/subscription failures in the last 3 hours. Potential customer impact for new activations. Requires investigation into activation pipeline failures on
api-am2-prod-01-rp. ● HIGH — Customer activations impactedEmerging Issues & CRIs
3 trackedSev 2MitigatedCRI
AI Summary
Customer-reported APIM service completely down due to unknown network connectivity issue. Required manual intervention with ad-hoc steps to restore. Root cause still under investigation — potential network policy or infrastructure issue affecting service reachability.
Sev 2MitigatedCRI
AI Summary
Large enterprise customer (Publix) experienced primary APIM instance becoming completely unresponsive at the control plane level. Unable to scale to alternative region. Escalated to Sev 1 by customer. Mitigated with manual intervention. Critical scenario — highlights need for control plane resilience during regional failures.
Sev 2MitigatedCRI
AI Summary
Customer APIM gateway
ps-prod-be-euw-apim-manageprotect2 went completely down in West Europe. Root cause: ApimBootstrapperService timed out with VMExtensionProvisioningError. Gateway could not provision VM extensions needed for the service. Mitigated with ad-hoc steps (likely VM reimage/restart).Other Mitigated & Resolved Sev2 Incidents
9 closedSev 2MitigatedCRI
AI Summary
Customer unable to modify rate limit policy configuration in their APIM instance. Blocking policy management operations. Mitigated with targeted ad-hoc steps. May indicate a serialization or validation issue in policy save path.
Sev 2Mitigated
AI Summary
An unplanned platform schedule upgrade affected a customer environment. Required manual intervention to stabilize. Classified as ACE customer engagement.
Sev 2MitigatedCRI
AI Summary
Consumption SKU APIM service experienced underlying App Service platform going down. Customer-impacting — Consumption tier relies entirely on platform health. Mitigated with ad-hoc intervention on the platform side.
AzSysLock — Code Integrity Violations (3x this week)
Sev 2Mitigated3x / 7 days
AI Summary
Three AzSysLock Code Integrity violations detected for Git-related binaries (
zlib1.dll, git.exe). All mitigated with TSG. Same pattern as previous weeks — Git binaries on APIM VMs do not have proper code signing. ● MEDIUM recurrence — Will recur until binaries updated across fleet.ASM SLAM — Malicious Domain Communication (2x this week)
Sev 2Mitigated2x / 7 days
AI Summary
Azure Security Monitoring (ASM SLAM) detected communication to domains classified as malicious from APIM infrastructure. Both confirmed as False Alarm. Likely DNS resolution to shared infrastructure IPs that happen to be flagged. No actual malicious activity.
AI Summary
Single Premium SKU NonVNET service became 100% unreachable in Central US with 2+ probing attempts. Resolved using standard TSG. Likely transient VM health issue.
AI Summary
Capacity exception request for Premium tier in UK South for Centrica. Resolved as By Design — standard capacity management process.
Ownership Summary
| Owner | Count | Key Themes |
|---|---|---|
| glfeokti | 7 | Network connectivity CRI, Publix Sev1, Gateway Down WEU, Consumption SKU, unplanned upgrade, ASM SLAM, AzRel Red Flag |
| v-nbudati | 3 | AzSysLock Code Integrity (zlib1.dll, git.exe), ASM SLAM |
| srajagrawal | 1 | Rate limit policy CRI |
| v-kambhavana | 1 | Gateway unreachable Central US |
| shubhash | 1 | Capacity Exception Request (By Design) |
Alert Trends — Top Firing Monitors (7 Days)
Monitors that fired 5+ times| Monitor / Alert Pattern | Sev | Count | Noise Assessment |
|---|---|---|---|
| ActivateSkuV2 Failed Orchestration (across 4 RP clusters) | SEV3 | 28 | ● HIGH — 7x per cluster (dwc, bn1, am2, aea). Needs auto-heal. |
| [RCM] Region running out of capacity | SEV3 | 17 | ● MEDIUM — Capacity planning alerts. RCM team tracking. |
| [RCM] High compute pool usage impacting scale-outs | SEV3 | 16 | ● MEDIUM — Scale-out contention. Related to capacity alerts above. |
| Orchestrations failing due to insufficient compute quota | SEV3 | 15 | ● MEDIUM — Directly downstream of capacity constraints. |
| SMAPI: Event Delay to Gateway >10 min | SEV3 | 14 | ● HIGH — Multiple instances (US, EU). Continued from prior weeks. |
| SMAPI 0 success rate for last hour | SEV3 | 7 | ● HIGH — Zero success rate = potential customer impact. |
| AppServiceCertificateExpiringSoon | SEV3 | 7 | ● Actionable — Cert renewal needed (EUAP clusters). |
| ApiMgmtSynthetic-Prod Unhealthy | SEV3 | 7 | ● LOW — Distributed synthetic probes. Mostly transient. |
| Cognitive Services LogToEventHub timeout ≥ 1% | SEV3 | 6 | ● LOW — AI Gateway logging path timeouts. |
| [VMSS] Roles down — ApimBootstrapper Package corrupted | SEV3 | 6 | ● MEDIUM — Bootstrapper corruption. Related to Gateway Down CRI. |
| GarbageCollection Orchestration stalled | SEV3 | 6 | ● LOW — GC orchestrations timing out. Auto-restart needed. |
Key Takeaways & Action Items
- 1MountainPass SR15 deadline approaching: AzRel Red Flag (803306220) requires Service Tag capacity requests by May 28 and SDPInProgress tag by Jun 4. Non-compliance escalates to EVP. Team must act immediately.
- 2SKUv2 Activation failures in West Europe (Active): Customer-impacting activation rate below 95% on api-am2-prod-01. Requires urgent investigation — unlike SKUv1 noise, this is production customer activations failing.
- 3High-severity CRIs this week: Publix Sev1 escalation, APIM network connectivity down, Gateway Down WEU, and Consumption platform failure. 4 distinct customer-impacting scenarios in 7 days — above average. Control plane resilience and bootstrapper reliability are recurring themes.
- 4Capacity pressure increasing: 48 capacity-related Sev3 alerts (RCM region + pool + quota combined). Orchestrations failing due to insufficient compute is creating CRI risk. RCM team needs to address capacity in affected regions.
- 5AzSysLock continues as noise: 4 Code Integrity incidents this week (3 mitigated, 1 still active/blocked). Git binaries remain unsigned across fleet. Systematic binary update or policy exception needed to stop recurring toil.
- 6DRI load heavily concentrated: glfeokti handled 7/15 incidents (47%) as US Sloop primary. V-nbudati handled AzSysLock triage (3 incidents). Overall manageable week (15 Sev2+ vs 24 last week, no outages).