On-Call Rotation (Sloop)
APIM Incident Manager
▶ INCOMING
Alexander Zaslonov (Primary)
Roman Kolesnikov (Primary)
Roman Kolesnikov (Primary)
◀ OUTGOING
Vitalii Kurokhtin (Primary)
Martin Dechev (Primary)
Martin Dechev (Primary)
US Sloop
▶ INCOMING
Javier Borrego (Primary)
Gleb Feoktistov (Backup)
Gleb Feoktistov (Backup)
◀ OUTGOING
Nima Kamoosi (Primary)
Javier Borrego (Backup)
Javier Borrego (Backup)
EU Sloop
▶ INCOMING
Ondrej Oprala (Primary)
Srajan Agrawal (Backup)
Srajan Agrawal (Backup)
◀ OUTGOING
Shubham Sharma (Primary)
Nima Kamoosi (Primary)
Ondrej Oprala (Primary)
Javier Borrego (Backup)
Nima Kamoosi (Primary)
Ondrej Oprala (Primary)
Javier Borrego (Backup)
Active Incidents
1 openSev 2ActiveCRI
AI Summary
Customer-reported incident (CRI) where the customer is receiving HTTP 500 errors from their APIM instance. Opened today and currently under active investigation by the EU Sloop on-call (ondrejoprala). This is a newly opened incident with investigation still in progress. No root cause identified yet.
Outages
2 declaredSev 2MitigatedOutageLSI
AI Summary
A Service Bus platform deployment introduced a regression that broke HTTP management API connectivity on port 443 across 4 Azure regions (South India, Australia Central 2, Norway West, Australia Southeast). APIM Resource Providers could not connect to their Service Bus namespaces, stalling DurableTask orchestrations and effectively disabling the APIM control plane (ARM PUT/PATCH/DELETE) in those regions. Customer-impacting, with PIR assigned (ID: 1429216).
Mitigation
Service Bus team rolled back their deployment, auto-recovering all affected APIM Resource Providers. TTM: ~5 hours. ● MEDIUM recurrence — Dependency on Service Bus deployment hygiene.
Sev 2MitigatedOutageLSI
AI Summary
BRAIN anomaly detection flagged unusual error budget consumption in the APIM ControlPlane Success Rate SLI in France Central. Root cause: storage account key rotation caused SMAPI authentication failures, specifically affecting the export API. Customer-impacting — impacted resources attached to the incident.
Mitigation
SMAPI upgraded on the same version to update settings with new keys. RP clusters recycled. Mitigated by vitaliik in ~5.3 hours. Root cause category: Service - Authentication/Credentials. ● LOW recurrence
Emerging Issues
2 trackedSev 2MitigatedRecurring
AI Summary
Administrative service update operations are starting to fail with VMSS (Virtual Machine Scale Set) exceptions. This pattern has been observed multiple times and is being tracked as an emerging issue. The failures affect internal service management operations. Mitigated but warrants monitoring for recurrence as the underlying VMSS behavior may trigger again during platform updates.
AI Summary
A Unified customer in Azure Government (Fairfax) is unable to update their certificate due to orchestration failures in the APIM Resource Provider. This is a customer-reported incident affecting government cloud specifically. Mitigated but tracked as an emerging issue due to the Fairfax-specific nature of the orchestration failure which may affect other Gov customers with similar configurations.
Other Mitigated & Resolved Sev2 Incidents
9 closedAI Summary
Gateway availability dropped to 0% for 3 services in East US 2 EUAP (SKUv2 BasicOrStandard tier). Investigation confirmed all 3 impacted services were internal preprov test services that had been terminated. No customer impact. Classified as false alarm. TTM: 15 minutes.
Sev 2MitigatedLSIRecurring
AI Summary
The RegionalRpHealthMonitorJob eternal orchestration in East US 2 EUAP had not completed successfully in 4+ hours. DRI unlocked 161 stuck services via ACIS "Unlock and Update Service Stuck in Transient State" action, then terminated all 161 internal SKUv2 services. No customer impact (EUAP only). This is a recurring pattern — the same Eternal Orchestration monitor fires frequently in EUAP. ● HIGH recurrence
Sev 2MitigatedLSI
AI Summary
Premium SKUv1 service (
sm-uks-mod-prod-moz-apim-001) in UK South went unreachable during VMSS rolling upgrade. Roles entered unhealthy state. VM replacement attempted but blocked by ongoing Platform OS Start operation. Issue self-healed after upgrade completed, proxy availability returned to 100%. Transient, no lasting customer impact. TTM: 36 minutes.AI Summary
Unified customer (Driver and Vehicle Standards Agency - UK Government) requested Premium V2 capacity whitelisting for 5 subscriptions in UK South. This is a capacity exception request, not a service health incident. Customer directed to use the standard capacity request template. Closed as "By Design." TTM: ~2 hours.
Sev 2MitigatedLSI
AI Summary
Developer Portal activation failure rate exceeded 5% threshold on scale unit
api-euapdm1-prod-scaleunit-002 in Central US EUAP. Affected internal BVT test services only. Self-healed after App Service platform recovered. Root cause: External/Customer Issue. No real customer impact (EUAP + internal test services). TTM: ~2 hours.Sev 2MitigatedLSI
AI Summary
APIM Gateway overhead latency spiked for AOAI Hub instance
apim-aoai-hub-prod-swedencentral-shared-03. The monitor reported unhealthy for ~1 hour before auto-recovering (watchdog reported healthy 10+ times in 45 minutes). Cognitive Services origin. Internal Microsoft impact only. TTM: ~3.3 hours (auto-mitigated by health monitor).Sev 2MitigatedRecurring
AI Summary
AzureSecurityPack detected
git.exe violating Code Integrity policy on APIM VMs (audit-only mode). Non-blocking, no service impact. This is a known recurring pattern that fires until Git binaries are updated or allowlisted. ● MEDIUM recurrenceSev 2MitigatedLSI
AI Summary
SMAPI success rate dropped below 99.95% for a consumption regional deployment in UK South. Likely transient and correlated with the Service Bus connectivity issue on the same day. Mitigated.
Sev 2MitigatedLSIRecurring
AI Summary
SKUv1 activation success rate dropped below 95% SLA on
api-cbr2-prod-01-rp (Australia Central 2). Investigation confirmed this is a known issue: customer activations are already disabled for this region and all failures are from internal devPortalRnr-* test services. Zero customer services in last 60 days. DRI notes: "Will review why this is even part of our alerts." ● Recurring (noise)Ownership Summary
| Owner | Count | Key Themes |
|---|---|---|
| ondrejoprala | 4 | CRI HTTP 500, Gateway EUAP false alarm, RP orchestration stall, SMAPI UK South |
| glfeokti | 3 | Service Bus outage (4 regions), SKUv1 activation noise, Fairfax orchestration (Gov) |
| nehagup | 2 | BRAIN SLI France Central outage, DevPortal activation EUAP |
| nimakamoosi | 1 | Gateway UK South transient (VMSS rolling upgrade) |
| shubhash | 1 | Capacity exception request (UK Gov customer) |
| tomkerkhove | 1 | AOAI Gateway overhead latency Sweden Central |
| maximagapov | 1 | AzSysLock Git binary code sign (recurring) |
| srananda | 1 | VMSS exception emerging issue |
Alert Trends — Top Firing Monitors (7 Days)
Sev2+ monitors that fired 2+ times| Monitor / Alert Pattern | Sev | Count | Sample Title | Noise Assessment |
|---|---|---|---|---|
| SMAPI: Event Delay to Gateway >10 min | SEV3 | 132 | SMAPI event delay to Gateway over 10 minutes — various services | ● Very High Volume — All 132 incidents still ACTIVE (Sev3). Potential threshold tuning needed. |
| ActivateSkuV2 Failed Orchestration | SEV3 | 39 | Orchestration unhealthy across multiple RP tenants | ● HIGH — All ACTIVE. Fires across many RP clusters. Consider severity/auto-heal. |
| ApiMgmtSynthetic-Prod Unhealthy | SEV3 | 26 | ApimWatmJob for various regions is unhealthy | ● MEDIUM — Distributed across regions. Likely transient synthetic probe failures. |
| [RCM] SKUv1 Region Running Out of Capacity | SEV3 | 25 | Various regions — StorageAccount/NotEnoughRunway | ● MEDIUM — Capacity planning alerts. Review runway thresholds. |
| Miscellaneous (No MonitorId) | SEV2-3 | 21 | Mixed: CreateBlade prevalidation, capacity requests, portal alerts, MSRC | Mix of CRIs, portal alerts, and manually-filed incidents with no monitor correlation. |
| SMAPI Success Rate <99.95% (Single Customer) | SEV3 | 17 | Per-customer SMAPI success rate drops across regions | ● MEDIUM — All ACTIVE. May indicate real customer impact or noisy threshold. |
| [RCM] SKUv2 Subscription Capacity Below Threshold | SEV3 | 15 | Various regions — KeyVaultV2 capacity concerns | ● LOW — Capacity alerts. Proactive, not impacting. |
| GarbageCollection Eternal Orchestration Stalled | SEV3 | 12 | GarbageCollection has not completed in last 2 hours | ● MEDIUM — All ACTIVE. Needs auto-restart or self-healing. |
| [RCM] High Compute Pool Usage | SEV3 | 12 | Scale-outs impacted in various regions | ● MEDIUM — Capacity warning. Monitor for escalation. |
| SKUv2 Activation Duration >5 min | SEV3 | 10 | Various regions — slow activations | ● LOW — Performance concern, not failure. |
| AppServiceCertificateExpiringSoon | SEV3 | 10 | Certificate expiry warnings across RP tenants | ● MEDIUM — Actionable: renew certificates before expiry. |
| SKUv2 Antares VM/AZ Quota >70% | SEV3 | 9 | North Europe — quota utilization warning | ● LOW — Proactive capacity alerting. |
Key Takeaways & Action Items
- 1Service Bus dependency outage: Service Bus deployment broke APIM control plane in 4 regions (IcM 795519059). PIR assigned. Consider adding Service Bus deployment change notifications to Sloop monitoring.
- 2Storage key rotation outage: BRAIN detected SMAPI auth failures in France Central due to storage key rotation (IcM 793601657). Ensure automated key rotation includes SMAPI settings update in the same operation.
- 3EUAP orchestration noise: Eternal Orchestration (HealthMonitor, GarbageCollection, ActivateSkuV2) continues to fire frequently in EUAP. 39+ ActivateSkuV2 failures, 12 GC stalls this week. Auto-restart logic urgently needed.
- 4SMAPI event delay flood: 132 Sev3 incidents for SMAPI event delay to Gateway >10 min. All still ACTIVE. This is the highest-volume monitor. Investigate root cause or adjust threshold.
- 5SKUv1 activation noise: Australia Central 2 has customer activations disabled but still fires activation SLA alerts from RnR test services. Filter test services from SLA calculation.
- 6Capacity alerts healthy: RCM capacity monitors (SKUv1 region, SKUv2 subscription, compute pool) are firing as designed. 25+ capacity warnings suggest proactive runway management is working.