Weekly Operations Report

APIM Incident Report

API Management — May 13 – May 20, 2026

Team: APIM / ServicingLoop
Period: Last 7 days
Generated: May 20, 2026 13:55 UTC
24
Total Sev2+
1
Active
20
Mitigated
3
Resolved
1
Outages
637
Sev3
661
Total 7d

On-Call Rotation (Sloop)

APIM Incident Manager
▶ INCOMING
Rafal Mielowski (Primary)
Shilpa Mani (Primary)
Tom Kerkhove (Primary)
◀ OUTGOING
Alexander Zaslonov (Primary)
Roman Kolesnikov (Primary)
US Sloop
▶ INCOMING
Gleb Feoktistov (Primary)
Zhongyuan Ren (Backup)
◀ OUTGOING
Javier Borrego (Primary)
Gleb Feoktistov (Backup)
EU Sloop
▶ INCOMING
Srajan Agrawal (Primary)
Maxim Agapov (Backup)
◀ OUTGOING
Shubham Sharma (Primary)
Ondrej Oprala (Primary)
Srajan Agrawal (Backup)

Active Incidents & Outages

1 active outage
IcM: 800967470Owner: alzaslonTeam: ServicingLoopCreated: May 19, 17:38 UTCOutage Declared: May 19, 21:46 UTCImpact Start: May 13
AI Summary
The Azure Portal APIM blade's "Microsoft Foundry" API import wizard is broken. The wizard hardcodes a lookup for a role definition named Azure AI User. This role was removed/renamed from the RBAC catalog between May 8-13. When the lookup returns empty, the Portal code passes null as the role definition ID to the role assignment creation call, causing HTTP 400 errors. Customer-impacting (S500 level), multiple services affected, ongoing since May 13. Responsible team: AI Foundry (ServiceId 24271).
Customer Mitigation
Customers can use REST API to perform the import with the correct role identifier. Portal fix pending from the AI Foundry team. Javier Borrego will post updates by 5/20 10am PST. ● HIGH — Active, customer-impacting

Emerging Issues

1 tracked
IcM: 51000001023424Owner: javierboTeam: Backend
AI Summary
Customer-reported issue where APIM is not refreshing the certificate for a Traffic Manager profile under a custom domain. Mitigated with manual intervention. Tracked as emerging issue since certificate refresh failures may affect other customers with similar Traffic Manager + custom domain configurations.

Other Mitigated & Resolved Sev2 Incidents

22 closed
IcM: 801002423Owner: javierboCreated: May 19HowFixed: Transient
AI Summary
BRAIN detected anomalous Gateway success rate drop in Central US EUAP (impact start 13:43 UTC). Customer-impacting flag set. Self-recovered after ~5 hours. RCA still needed. Classified as transient by DRI.
AzurePortalWAWSAlert — Blades Failed to Load (5x this week)
Sev 2Mitigated5x / 7 days
IcMs: 800784758, 800180778, 800074452, 798641314, 798534595Owners: ondrejoprala, javierbo, tuanguye
AI Summary
Portal blade load failures fired 5 times this week (all transient, self-resolved). Same pattern as previous weeks: 5+ users see ServiceBlade failures barely exceeding threshold. Backend healthy each time. ● Fires ~daily (noise)Action needed: Raise threshold or add sustained-failure criteria.
SKUv1 Activation SuccessRate Below 95% (4x this week)
Sev 2Mitigated4x / 7 days
IcMs: 800795488, 798585977, 798339340, 797320933Owners: ondrejoprala, javierboRegions: EUAP, sec-prod
AI Summary
SKUv1 activation SLA breached 4 times this week, all on EUAP or regions with disabled customer activations. Zero customer impact — all failures from internal test/RnR services. Recurring weekly noise. ● Weekly noiseAction: Filter RnR services from SLA or lower severity for internal-only.
AzSysLock — Code Integrity Violations (3x this week)
Sev 2Mitigated3x / 7 days
IcMs: 800964408, 800898349, 800898163Owners: javierbo, ondrejopralaBinaries: GatewayV2, Redis dbghelp.dll, node.exe
AI Summary
AzureSecurityPack detected 3 separate binaries violating Code Integrity policy (audit-only mode). Non-blocking, no service impact. GatewayV2 binary, Redis dbghelp.dll, and node.exe. Fixed with TSG / allowlisting. ● MEDIUM recurrence — Will recur until binaries updated on affected VMs.
IcM: 800936420Owner: javierboCreated: May 19
AI Summary
Single Premium SKU NonVNET service became 100% unreachable with 2+ probing attempts. Mitigated. Likely transient due to platform update or VM health issue.
IcM: 800431915Owner: javierboCreated: May 17HowFixed: Transient
AI Summary
Second BRAIN SLI Gateway anomaly in Central US EUAP this week. Same pattern as 801002423. Transient. EUAP continues to be noisy for SLI detection.
Other Resolved/Mitigated Incidents (6 additional)
Sev 2Mitigated/Resolved
Includes: Capacity requests (By Design), SMAPI Canada Central, AOAI Hub Quota (False Alarm), ASM SLAM Security Alert, ActivateConsumptionService, RegionalRpHealthMonitorJob
Summary
800901140: SMAPI Canada Central success rate drop (Transient). 800873667/21000001029432: Capacity Exception Requests UK South (By Design). 800207434: AOAI Hub Scale Group Low Quota (False Alarm). 798502935: ASM SLAM Malicious Communication alert (Customer Error). 797980084: Unhealthy ActivateConsumptionService orchestration (Transient). 797427647: RegionalRpHealthMonitorJob stalled in Korea (Ad-Hoc steps).

Ownership Summary

OwnerCountKey Themes
javierbo11BRAIN SLI (2x), Portal Blades (3x), AzSysLock (2x), SKUv1 activation, Gateway unreachable, CRI cert refresh, orchestration
ondrejoprala8SMAPI success rate, AzSysLock (2x), SKUv1 activation (3x), Portal Blades, ASM SLAM
shubhash2Capacity Exception Requests (By Design)
alzaslon1AI Foundry API import (Active Outage)
tomkerkhove1AOAI Hub Scale Group Quota (False Alarm)
tuanguye1Portal Blades (Transient)

Alert Trends — Top Firing Monitors (7 Days)

Monitors that fired 2+ times

Key Takeaways & Action Items