On-Call Rotation (Sloop)
APIM Incident Manager
▶ INCOMING
Rafal Mielowski (Primary)
Shilpa Mani (Primary)
Tom Kerkhove (Primary)
Shilpa Mani (Primary)
Tom Kerkhove (Primary)
◀ OUTGOING
Alexander Zaslonov (Primary)
Roman Kolesnikov (Primary)
Roman Kolesnikov (Primary)
US Sloop
▶ INCOMING
Gleb Feoktistov (Primary)
Zhongyuan Ren (Backup)
Zhongyuan Ren (Backup)
◀ OUTGOING
Javier Borrego (Primary)
Gleb Feoktistov (Backup)
Gleb Feoktistov (Backup)
EU Sloop
▶ INCOMING
Srajan Agrawal (Primary)
Maxim Agapov (Backup)
Maxim Agapov (Backup)
◀ OUTGOING
Shubham Sharma (Primary)
Ondrej Oprala (Primary)
Srajan Agrawal (Backup)
Ondrej Oprala (Primary)
Srajan Agrawal (Backup)
Active Incidents & Outages
1 active outageSev 2ActiveOutageCRI
AI Summary
The Azure Portal APIM blade's "Microsoft Foundry" API import wizard is broken. The wizard hardcodes a lookup for a role definition named
Azure AI User. This role was removed/renamed from the RBAC catalog between May 8-13. When the lookup returns empty, the Portal code passes null as the role definition ID to the role assignment creation call, causing HTTP 400 errors. Customer-impacting (S500 level), multiple services affected, ongoing since May 13. Responsible team: AI Foundry (ServiceId 24271).Customer Mitigation
Customers can use REST API to perform the import with the correct role identifier. Portal fix pending from the AI Foundry team. Javier Borrego will post updates by 5/20 10am PST. ● HIGH — Active, customer-impacting
Emerging Issues
1 trackedAI Summary
Customer-reported issue where APIM is not refreshing the certificate for a Traffic Manager profile under a custom domain. Mitigated with manual intervention. Tracked as emerging issue since certificate refresh failures may affect other customers with similar Traffic Manager + custom domain configurations.
Other Mitigated & Resolved Sev2 Incidents
22 closedSev 2MitigatedLSI
AI Summary
BRAIN detected anomalous Gateway success rate drop in Central US EUAP (impact start 13:43 UTC). Customer-impacting flag set. Self-recovered after ~5 hours. RCA still needed. Classified as transient by DRI.
AzurePortalWAWSAlert — Blades Failed to Load (5x this week)
Sev 2Mitigated5x / 7 days
AI Summary
Portal blade load failures fired 5 times this week (all transient, self-resolved). Same pattern as previous weeks: 5+ users see
ServiceBlade failures barely exceeding threshold. Backend healthy each time. ● Fires ~daily (noise) — Action needed: Raise threshold or add sustained-failure criteria.SKUv1 Activation SuccessRate Below 95% (4x this week)
Sev 2Mitigated4x / 7 days
AI Summary
SKUv1 activation SLA breached 4 times this week, all on EUAP or regions with disabled customer activations. Zero customer impact — all failures from internal test/RnR services. Recurring weekly noise. ● Weekly noise — Action: Filter RnR services from SLA or lower severity for internal-only.
AzSysLock — Code Integrity Violations (3x this week)
Sev 2Mitigated3x / 7 days
AI Summary
AzureSecurityPack detected 3 separate binaries violating Code Integrity policy (audit-only mode). Non-blocking, no service impact. GatewayV2 binary, Redis
dbghelp.dll, and node.exe. Fixed with TSG / allowlisting. ● MEDIUM recurrence — Will recur until binaries updated on affected VMs.Sev 2MitigatedLSI
AI Summary
Single Premium SKU NonVNET service became 100% unreachable with 2+ probing attempts. Mitigated. Likely transient due to platform update or VM health issue.
Sev 2MitigatedLSI
AI Summary
Second BRAIN SLI Gateway anomaly in Central US EUAP this week. Same pattern as 801002423. Transient. EUAP continues to be noisy for SLI detection.
Other Resolved/Mitigated Incidents (6 additional)
Sev 2Mitigated/Resolved
Summary
800901140: SMAPI Canada Central success rate drop (Transient). 800873667/21000001029432: Capacity Exception Requests UK South (By Design). 800207434: AOAI Hub Scale Group Low Quota (False Alarm). 798502935: ASM SLAM Malicious Communication alert (Customer Error). 797980084: Unhealthy ActivateConsumptionService orchestration (Transient). 797427647: RegionalRpHealthMonitorJob stalled in Korea (Ad-Hoc steps).
Ownership Summary
| Owner | Count | Key Themes |
|---|---|---|
| javierbo | 11 | BRAIN SLI (2x), Portal Blades (3x), AzSysLock (2x), SKUv1 activation, Gateway unreachable, CRI cert refresh, orchestration |
| ondrejoprala | 8 | SMAPI success rate, AzSysLock (2x), SKUv1 activation (3x), Portal Blades, ASM SLAM |
| shubhash | 2 | Capacity Exception Requests (By Design) |
| alzaslon | 1 | AI Foundry API import (Active Outage) |
| tomkerkhove | 1 | AOAI Hub Scale Group Quota (False Alarm) |
| tuanguye | 1 | Portal Blades (Transient) |
Alert Trends — Top Firing Monitors (7 Days)
Monitors that fired 2+ times| Monitor / Alert Pattern | Sev | Count | Noise Assessment |
|---|---|---|---|
| SMAPI: Event Delay to Gateway >10 min | SEV3 | 194 | ● Very High Volume — Highest volume alert. All Sev3. |
| ActivateSkuV2 Failed Orchestration | SEV3 | 45 | ● HIGH — Fires across many RP clusters. Needs auto-heal. |
| GarbageCollection Eternal Orchestration | SEV3 | 24 | ● MEDIUM — Stalled GC orchestrations. Auto-restart needed. |
| SMAPI Success Rate <99.95% (Single Customer) | SEV3 | 24 | ● MEDIUM — Per-customer drops. May indicate real impact. |
| ApiMgmtSynthetic-Prod Unhealthy | SEV3 | 23 | ● MEDIUM — Distributed synthetic probe failures. |
| NewCertificateVersionCreationOperationFailed | SEV3 | 19 | ● MEDIUM — Cert version creation issues. Investigate. |
| AppServiceCertificateExpiringSoon | SEV3 | 14 | ● Actionable — Renew before expiry. |
| SKUv2 Activation Duration >5 min | SEV3 | 14 | ● LOW — Performance concern, not failure. |
| Orchestrations Failing Insufficient Compute Quota | SEV3 | 13 | ● MEDIUM — Capacity constraint. RCM action needed. |
| [RCM] High Compute Pool Usage | SEV3 | 13 | ● LOW — Proactive capacity alerting. |
| BRAIN ControlPlane Success Rate SLI | SEV3 | 12 | ● MEDIUM — Regional SLI anomalies. |
| Dav4 Capacity Low | SEV3 | 10 | ● LOW — Capacity planning alerts. |
Key Takeaways & Action Items
- 1AI Foundry API Import Outage (Active): Portal import wizard broken since May 13 due to removed "Azure AI User" RBAC role. Customer-impacting, S500 level. Workaround via REST API. Fix pending from AI Foundry team. Track daily until resolved.
- 2Portal Blades noise escalating: 5 Sev2 incidents this week (up from 2 last week). All transient false alarms. Urgently needs threshold adjustment to stop DRI toil.
- 3SKUv1 Activation continues as noise: 4 Sev2 incidents this week, all internal/RnR. No customer impact ever. Filter from SLA calculation is overdue.
- 4SMAPI event delay flood persists: 194 Sev3 incidents (up from 132 last week). Highest volume monitor. Root cause investigation or threshold tuning critical.
- 5Certificate hygiene: 19 NewCertificateVersionCreationOperationFailed + 14 AppServiceCertificateExpiringSoon. Combined 33 cert-related alerts. Proactive cert rotation needed.
- 6DRI load concentrated: javierbo handled 11/24 incidents (46%) and ondrejoprala handled 8/24 (33%). Two DRIs covered 79% of all Sev2 work this week.