Weekly Operations Report

APIM Incident Report

API Management — May 7 – May 13, 2026

Team: APIM / ServicingLoop
Period: Last 7 days
Generated: May 13, 2026 15:32 UTC
14
Total Sev2+
1
Active
11
Mitigated
2
Resolved
2
Outages
36
Sev3
50
Total 7d

On-Call Rotation (Sloop)

APIM Incident Manager
▶ INCOMING
Alexander Zaslonov (Primary)
Roman Kolesnikov (Primary)
◀ OUTGOING
Vitalii Kurokhtin (Primary)
Martin Dechev (Primary)
US Sloop
▶ INCOMING
Javier Borrego (Primary)
Gleb Feoktistov (Backup)
◀ OUTGOING
Nima Kamoosi (Primary)
Javier Borrego (Backup)
EU Sloop
▶ INCOMING
Ondrej Oprala (Primary)
Srajan Agrawal (Backup)
◀ OUTGOING
Shubham Sharma (Primary)
Nima Kamoosi (Primary)
Ondrej Oprala (Primary)
Javier Borrego (Backup)

Active Incidents

1 open
IcM: 51000001019637Owner: ondrejopralaTeam: BackendCreated: May 13, 11:05 UTC
AI Summary
Customer-reported incident (CRI) where the customer is receiving HTTP 500 errors from their APIM instance. Opened today and currently under active investigation by the EU Sloop on-call (ondrejoprala). This is a newly opened incident with investigation still in progress. No root cause identified yet.

Outages

2 declared
IcM: 795519059Owner: glfeoktiTeam: PlatformCreated: May 11, 05:07 UTCOutage Declared: May 11, 07:12 UTC
AI Summary
A Service Bus platform deployment introduced a regression that broke HTTP management API connectivity on port 443 across 4 Azure regions (South India, Australia Central 2, Norway West, Australia Southeast). APIM Resource Providers could not connect to their Service Bus namespaces, stalling DurableTask orchestrations and effectively disabling the APIM control plane (ARM PUT/PATCH/DELETE) in those regions. Customer-impacting, with PIR assigned (ID: 1429216).
Mitigation
Service Bus team rolled back their deployment, auto-recovering all affected APIM Resource Providers. TTM: ~5 hours. ● MEDIUM recurrence — Dependency on Service Bus deployment hygiene.
IcM: 793601657Owner: nehagupTeam: PlatformCreated: May 7, 23:23 UTCOutage Declared: May 7, 23:23 UTC
AI Summary
BRAIN anomaly detection flagged unusual error budget consumption in the APIM ControlPlane Success Rate SLI in France Central. Root cause: storage account key rotation caused SMAPI authentication failures, specifically affecting the export API. Customer-impacting — impacted resources attached to the incident.
Mitigation
SMAPI upgraded on the same version to update settings with new keys. RP clusters recycled. Mitigated by vitaliik in ~5.3 hours. Root cause category: Service - Authentication/Credentials. ● LOW recurrence

Emerging Issues

2 tracked
IcM: 793537192Owner: sranandaTeam: PlatformCreated: May 7
AI Summary
Administrative service update operations are starting to fail with VMSS (Virtual Machine Scale Set) exceptions. This pattern has been observed multiple times and is being tracked as an emerging issue. The failures affect internal service management operations. Mitigated but warrants monitoring for recurrence as the underlying VMSS behavior may trigger again during platform updates.
IcM: 792769102Owner: glfeoktiTeam: PlatformCreated: Before reporting period (carryover)
AI Summary
A Unified customer in Azure Government (Fairfax) is unable to update their certificate due to orchestration failures in the APIM Resource Provider. This is a customer-reported incident affecting government cloud specifically. Mitigated but tracked as an emerging issue due to the Fairfax-specific nature of the orchestration failure which may affect other Gov customers with similar configurations.

Other Mitigated & Resolved Sev2 Incidents

9 closed
IcM: 797018274Owner: ondrejopralaCreated: May 13, 08:41 UTCHowFixed: False Alarm
AI Summary
Gateway availability dropped to 0% for 3 services in East US 2 EUAP (SKUv2 BasicOrStandard tier). Investigation confirmed all 3 impacted services were internal preprov test services that had been terminated. No customer impact. Classified as false alarm. TTM: 15 minutes.
IcM: 796937199Owner: ondrejopralaCreated: May 13, 05:43 UTCHowFixed: Ad-Hoc Steps
AI Summary
The RegionalRpHealthMonitorJob eternal orchestration in East US 2 EUAP had not completed successfully in 4+ hours. DRI unlocked 161 stuck services via ACIS "Unlock and Update Service Stuck in Transient State" action, then terminated all 161 internal SKUv2 services. No customer impact (EUAP only). This is a recurring pattern — the same Eternal Orchestration monitor fires frequently in EUAP. ● HIGH recurrence
IcM: 796729139Owner: nimakamoosiCreated: May 12, 22:32 UTCHowFixed: Transient
AI Summary
Premium SKUv1 service (sm-uks-mod-prod-moz-apim-001) in UK South went unreachable during VMSS rolling upgrade. Roles entered unhealthy state. VM replacement attempted but blocked by ongoing Platform OS Start operation. Issue self-healed after upgrade completed, proxy availability returned to 100%. Transient, no lasting customer impact. TTM: 36 minutes.
IcM: 796496317Owner: shubhashCreated: May 12, 15:11 UTCHowFixed: By Design
AI Summary
Unified customer (Driver and Vehicle Standards Agency - UK Government) requested Premium V2 capacity whitelisting for 5 subscriptions in UK South. This is a capacity exception request, not a service health incident. Customer directed to use the standard capacity request template. Closed as "By Design." TTM: ~2 hours.
IcM: 796170003Owner: nehagupTeam: PlatformCreated: May 12, 04:01 UTCHowFixed: External
AI Summary
Developer Portal activation failure rate exceeded 5% threshold on scale unit api-euapdm1-prod-scaleunit-002 in Central US EUAP. Affected internal BVT test services only. Self-healed after App Service platform recovered. Root cause: External/Customer Issue. No real customer impact (EUAP + internal test services). TTM: ~2 hours.
IcM: 796152710Owner: tomkerkhoveTeam: GatewayCreated: May 12, 03:23 UTCHowFixed: Auto-mitigated
AI Summary
APIM Gateway overhead latency spiked for AOAI Hub instance apim-aoai-hub-prod-swedencentral-shared-03. The monitor reported unhealthy for ~1 hour before auto-recovering (watchdog reported healthy 10+ times in 45 minutes). Cognitive Services origin. Internal Microsoft impact only. TTM: ~3.3 hours (auto-mitigated by health monitor).
IcM: 796150739Owner: maximagapovCreated: May 12
AI Summary
AzureSecurityPack detected git.exe violating Code Integrity policy on APIM VMs (audit-only mode). Non-blocking, no service impact. This is a known recurring pattern that fires until Git binaries are updated or allowlisted. ● MEDIUM recurrence
IcM: 795673422Owner: ondrejopralaCreated: May 11
AI Summary
SMAPI success rate dropped below 99.95% for a consumption regional deployment in UK South. Likely transient and correlated with the Service Bus connectivity issue on the same day. Mitigated.
IcM: 795668716Owner: glfeoktiTeam: PlatformCreated: May 11, 10:36 UTCHowFixed: By Design
AI Summary
SKUv1 activation success rate dropped below 95% SLA on api-cbr2-prod-01-rp (Australia Central 2). Investigation confirmed this is a known issue: customer activations are already disabled for this region and all failures are from internal devPortalRnr-* test services. Zero customer services in last 60 days. DRI notes: "Will review why this is even part of our alerts." ● Recurring (noise)

Ownership Summary

OwnerCountKey Themes
ondrejoprala4CRI HTTP 500, Gateway EUAP false alarm, RP orchestration stall, SMAPI UK South
glfeokti3Service Bus outage (4 regions), SKUv1 activation noise, Fairfax orchestration (Gov)
nehagup2BRAIN SLI France Central outage, DevPortal activation EUAP
nimakamoosi1Gateway UK South transient (VMSS rolling upgrade)
shubhash1Capacity exception request (UK Gov customer)
tomkerkhove1AOAI Gateway overhead latency Sweden Central
maximagapov1AzSysLock Git binary code sign (recurring)
srananda1VMSS exception emerging issue

Alert Trends — Top Firing Monitors (7 Days)

Sev2+ monitors that fired 2+ times

Key Takeaways & Action Items