APIM Incident Report — May 13, 2026

⇄

On-Call Rotation (Sloop)

APIM Incident Manager

▶ INCOMING

Alexander Zaslonov (Primary)
Roman Kolesnikov (Primary)

◀ OUTGOING

Vitalii Kurokhtin (Primary)
Martin Dechev (Primary)

US Sloop

▶ INCOMING

Javier Borrego (Primary)
Gleb Feoktistov (Backup)

◀ OUTGOING

Nima Kamoosi (Primary)
Javier Borrego (Backup)

EU Sloop

▶ INCOMING

Ondrej Oprala (Primary)
Srajan Agrawal (Backup)

◀ OUTGOING

Shubham Sharma (Primary)
Nima Kamoosi (Primary)
Ondrej Oprala (Primary)
Javier Borrego (Backup)

⚠

Active Incidents

1 open

Received invalid status code: 500

Sev 2ActiveCRI

IcM: 51000001019637Owner: ondrejopralaTeam: BackendCreated: May 13, 11:05 UTC

AI Summary

Customer-reported incident (CRI) where the customer is receiving HTTP 500 errors from their APIM instance. Opened today and currently under active investigation by the EU Sloop on-call (ondrejoprala). This is a newly opened incident with investigation still in progress. No root cause identified yet.

⚡

Outages

2 declared

Service Bus Connectivity Blocked on Port 443 — South India (api-ma1-prod-01)

Sev 2MitigatedOutageLSI

IcM: 795519059Owner: glfeoktiTeam: PlatformCreated: May 11, 05:07 UTCOutage Declared: May 11, 07:12 UTC

AI Summary

A Service Bus platform deployment introduced a regression that broke HTTP management API connectivity on port 443 across 4 Azure regions (South India, Australia Central 2, Norway West, Australia Southeast). APIM Resource Providers could not connect to their Service Bus namespaces, stalling DurableTask orchestrations and effectively disabling the APIM control plane (ARM PUT/PATCH/DELETE) in those regions. Customer-impacting, with PIR assigned (ID: 1429216).

Mitigation

Service Bus team rolled back their deployment, auto-recovering all affected APIM Resource Providers. TTM: ~5 hours. ● MEDIUM recurrence — Dependency on Service Bus deployment hygiene.

BRAIN: Unusual Trend in Success Rate SLI — France Central

Sev 2MitigatedOutageLSI

IcM: 793601657Owner: nehagupTeam: PlatformCreated: May 7, 23:23 UTCOutage Declared: May 7, 23:23 UTC

AI Summary

BRAIN anomaly detection flagged unusual error budget consumption in the APIM ControlPlane Success Rate SLI in France Central. Root cause: storage account key rotation caused SMAPI authentication failures, specifically affecting the export API. Customer-impacting — impacted resources attached to the incident.

Mitigation

SMAPI upgraded on the same version to update settings with new keys. RP clusters recycled. Mitigated by vitaliik in ~5.3 hours. Root cause category: Service - Authentication/Credentials. ● LOW recurrence

★

Emerging Issues

2 tracked

Emerging Issue: Administrative Service Update Operations Failing with VMSS Exception

Sev 2MitigatedRecurring

IcM: 793537192Owner: sranandaTeam: PlatformCreated: May 7

AI Summary

Administrative service update operations are starting to fail with VMSS (Virtual Machine Scale Set) exceptions. This pattern has been observed multiple times and is being tracked as an emerging issue. The failures affect internal service management operations. Mitigated but warrants monitoring for recurrence as the underlying VMSS behavior may trigger again during platform updates.

Emerging Issue: APIM RP Fairfax Orchestration Failure — Azure Gov Customer Unable to Update Cert

Sev 2MitigatedCRI

IcM: 792769102Owner: glfeoktiTeam: PlatformCreated: Before reporting period (carryover)

AI Summary

A Unified customer in Azure Government (Fairfax) is unable to update their certificate due to orchestration failures in the APIM Resource Provider. This is a customer-reported incident affecting government cloud specifically. Mitigated but tracked as an emerging issue due to the Fairfax-specific nature of the orchestration failure which may affect other Gov customers with similar configurations.

✓

Other Mitigated & Resolved Sev2 Incidents

9 closed

Gateway NonVNET 100% Not Reachable — East US 2 EUAP (SKUv2 BasicOrStandard)

Sev 2MitigatedLSI

IcM: 797018274Owner: ondrejopralaCreated: May 13, 08:41 UTCHowFixed: False Alarm

AI Summary

Gateway availability dropped to 0% for 3 services in East US 2 EUAP (SKUv2 BasicOrStandard tier). Investigation confirmed all 3 impacted services were internal preprov test services that had been terminated. No customer impact. Classified as false alarm. TTM: 15 minutes.

RP Orchestration: RegionalRpHealthMonitorJob Stalled — East US 2 EUAP

Sev 2MitigatedLSIRecurring

IcM: 796937199Owner: ondrejopralaCreated: May 13, 05:43 UTCHowFixed: Ad-Hoc Steps

AI Summary

The RegionalRpHealthMonitorJob eternal orchestration in East US 2 EUAP had not completed successfully in 4+ hours. DRI unlocked 161 stuck services via ACIS "Unlock and Update Service Stuck in Transient State" action, then terminated all 161 internal SKUv2 services. No customer impact (EUAP only). This is a recurring pattern — the same Eternal Orchestration monitor fires frequently in EUAP. ● HIGH recurrence

Gateway NonVNET 100% Not Reachable — UK South (SKUv1 Premium)

Sev 2MitigatedLSI

IcM: 796729139Owner: nimakamoosiCreated: May 12, 22:32 UTCHowFixed: Transient

AI Summary

Premium SKUv1 service (sm-uks-mod-prod-moz-apim-001) in UK South went unreachable during VMSS rolling upgrade. Roles entered unhealthy state. VM replacement attempted but blocked by ongoing Platform OS Start operation. Issue self-healed after upgrade completed, proxy availability returned to 100%. Transient, no lasting customer impact. TTM: 36 minutes.

Capacity Exception Request — UK South Premium V2 (Driver & Vehicle Standards Agency)

Sev 2MitigatedCRI

IcM: 796496317Owner: shubhashCreated: May 12, 15:11 UTCHowFixed: By Design

AI Summary

Unified customer (Driver and Vehicle Standards Agency - UK Government) requested Premium V2 capacity whitelisting for 5 subscriptions in UK South. This is a capacity exception request, not a service health incident. Customer directed to use the standard capacity request template. Closed as "By Design." TTM: ~2 hours.

SKUv2 Developer Portal Activation Failure Rate >5% — Central US EUAP

Sev 2MitigatedLSI

IcM: 796170003Owner: nehagupTeam: PlatformCreated: May 12, 04:01 UTCHowFixed: External

AI Summary

Developer Portal activation failure rate exceeded 5% threshold on scale unit api-euapdm1-prod-scaleunit-002 in Central US EUAP. Affected internal BVT test services only. Self-healed after App Service platform recovered. Root cause: External/Customer Issue. No real customer impact (EUAP + internal test services). TTM: ~2 hours.

AOAI GatewayOverhead Latency — Sweden Central

Sev 2MitigatedLSI

IcM: 796152710Owner: tomkerkhoveTeam: GatewayCreated: May 12, 03:23 UTCHowFixed: Auto-mitigated

AI Summary

APIM Gateway overhead latency spiked for AOAI Hub instance apim-aoai-hub-prod-swedencentral-shared-03. The monitor reported unhealthy for ~1 hour before auto-recovering (watchdog reported healthy 10+ times in 45 minutes). Cognitive Services origin. Internal Microsoft impact only. TTM: ~3.3 hours (auto-mitigated by health monitor).

AzSysLock: Git Binary Code Sign Policy Violation (CI)

Sev 2MitigatedRecurring

IcM: 796150739Owner: maximagapovCreated: May 12

AI Summary

AzureSecurityPack detected git.exe violating Code Integrity policy on APIM VMs (audit-only mode). Non-blocking, no service impact. This is a known recurring pattern that fires until Git binaries are updated or allowlisted. ● MEDIUM recurrence

SMAPI Success Rate Below 99.95% — UK South (Consumption)

Sev 2MitigatedLSI

IcM: 795673422Owner: ondrejopralaCreated: May 11

AI Summary

SMAPI success rate dropped below 99.95% for a consumption regional deployment in UK South. Likely transient and correlated with the Service Bus connectivity issue on the same day. Mitigated.

SKUv1 Activation SuccessRate Below 95% — Australia Central 2

Sev 2MitigatedLSIRecurring

IcM: 795668716Owner: glfeoktiTeam: PlatformCreated: May 11, 10:36 UTCHowFixed: By Design

AI Summary

SKUv1 activation success rate dropped below 95% SLA on api-cbr2-prod-01-rp (Australia Central 2). Investigation confirmed this is a known issue: customer activations are already disabled for this region and all failures are from internal devPortalRnr-* test services. Zero customer services in last 60 days. DRI notes: "Will review why this is even part of our alerts." ● Recurring (noise)

●

Ownership Summary

Owner	Count	Key Themes
ondrejoprala	4	CRI HTTP 500, Gateway EUAP false alarm, RP orchestration stall, SMAPI UK South
glfeokti	3	Service Bus outage (4 regions), SKUv1 activation noise, Fairfax orchestration (Gov)
nehagup	2	BRAIN SLI France Central outage, DevPortal activation EUAP
nimakamoosi	1	Gateway UK South transient (VMSS rolling upgrade)
shubhash	1	Capacity exception request (UK Gov customer)
tomkerkhove	1	AOAI Gateway overhead latency Sweden Central
maximagapov	1	AzSysLock Git binary code sign (recurring)
srananda	1	VMSS exception emerging issue

↻

Alert Trends — Top Firing Monitors (7 Days)

Sev2+ monitors that fired 2+ times

Monitor / Alert Pattern	Sev	Count	Sample Title	Noise Assessment
SMAPI: Event Delay to Gateway >10 min	SEV3	132	SMAPI event delay to Gateway over 10 minutes — various services	● Very High Volume — All 132 incidents still ACTIVE (Sev3). Potential threshold tuning needed.
ActivateSkuV2 Failed Orchestration	SEV3	39	Orchestration unhealthy across multiple RP tenants	● HIGH — All ACTIVE. Fires across many RP clusters. Consider severity/auto-heal.
ApiMgmtSynthetic-Prod Unhealthy	SEV3	26	ApimWatmJob for various regions is unhealthy	● MEDIUM — Distributed across regions. Likely transient synthetic probe failures.
[RCM] SKUv1 Region Running Out of Capacity	SEV3	25	Various regions — StorageAccount/NotEnoughRunway	● MEDIUM — Capacity planning alerts. Review runway thresholds.
Miscellaneous (No MonitorId)	SEV2-3	21	Mixed: CreateBlade prevalidation, capacity requests, portal alerts, MSRC	Mix of CRIs, portal alerts, and manually-filed incidents with no monitor correlation.
SMAPI Success Rate <99.95% (Single Customer)	SEV3	17	Per-customer SMAPI success rate drops across regions	● MEDIUM — All ACTIVE. May indicate real customer impact or noisy threshold.
[RCM] SKUv2 Subscription Capacity Below Threshold	SEV3	15	Various regions — KeyVaultV2 capacity concerns	● LOW — Capacity alerts. Proactive, not impacting.
GarbageCollection Eternal Orchestration Stalled	SEV3	12	GarbageCollection has not completed in last 2 hours	● MEDIUM — All ACTIVE. Needs auto-restart or self-healing.
[RCM] High Compute Pool Usage	SEV3	12	Scale-outs impacted in various regions	● MEDIUM — Capacity warning. Monitor for escalation.
SKUv2 Activation Duration >5 min	SEV3	10	Various regions — slow activations	● LOW — Performance concern, not failure.
AppServiceCertificateExpiringSoon	SEV3	10	Certificate expiry warnings across RP tenants	● MEDIUM — Actionable: renew certificates before expiry.
SKUv2 Antares VM/AZ Quota >70%	SEV3	9	North Europe — quota utilization warning	● LOW — Proactive capacity alerting.

⚡

Key Takeaways & Action Items

1Service Bus dependency outage: Service Bus deployment broke APIM control plane in 4 regions (IcM 795519059). PIR assigned. Consider adding Service Bus deployment change notifications to Sloop monitoring.
2Storage key rotation outage: BRAIN detected SMAPI auth failures in France Central due to storage key rotation (IcM 793601657). Ensure automated key rotation includes SMAPI settings update in the same operation.
3EUAP orchestration noise: Eternal Orchestration (HealthMonitor, GarbageCollection, ActivateSkuV2) continues to fire frequently in EUAP. 39+ ActivateSkuV2 failures, 12 GC stalls this week. Auto-restart logic urgently needed.
4SMAPI event delay flood: 132 Sev3 incidents for SMAPI event delay to Gateway >10 min. All still ACTIVE. This is the highest-volume monitor. Investigate root cause or adjust threshold.
5SKUv1 activation noise: Australia Central 2 has customer activations disabled but still fires activation SLA alerts from RnR test services. Filter test services from SLA calculation.
6Capacity alerts healthy: RCM capacity monitors (SKUv1 region, SKUv2 subscription, compute pool) are firing as designed. 25+ capacity warnings suggest proactive runway management is working.