Weekly Operations Report

APIM Incident Report

API Management — Apr 30 – May 6, 2026

Team: APIM / ServicingLoop
Period: Last 7 days
Generated: May 6, 2026 16:03 UTC
21
Total Sev2+
3
Active
15
Mitigated
3
Resolved
3
Emerging
5
Repeated
0
Outages

On-Call Rotation (Sloop)

Incident Manager
▶ INCOMING
Martin Dechev (Primary)
Vitalii Kurokhtin (Primary)
◀ OUTGOING
Neha Gupta (Primary)
Joaquin Vano Newman (Primary)
Rafal Mielowski (Primary)
US Sloop
▶ INCOMING
Nima Kamoosi (Primary)
Gleb Feoktistov (Backup)
◀ OUTGOING
Ethan Lao (Primary)
Nima Kamoosi (Backup)
EU Sloop
▶ INCOMING
Shubham Sharma (Primary)
Ondrej Oprala (Backup)
◀ OUTGOING
Maxim Agapov (Primary)
Macko Treder (Backup)

Active Incidents

3 open
IcM: 51000001009580Owner: maximagapovTeam: BackendCreated: May 6, 08:26 UTC
Summary
Customer-reported intermittent HTTP 500 errors. Opened today, under active investigation by Backend on-call. No further details available yet (newly opened).
IcM: 791543403Owner: ethanlaoTeam: BackendCreated: May 4
Root Cause
SCUBA MCPBurndown2 campaign flagged unauthenticated MCP endpoint on APIM's ServiceTree ID. Security compliance finding, not a service health issue.
Mitigation
DRI assessed: "These are customer endpoints, no action APIM can take." Needs AIM.Assess tagging per MCP TSG. ● MEDIUM recurrence — more burndown alerts expected.
IcM: 791523122Owner: alzaslonTeam: ServicingLoopCreated: May 4
Root Cause
Misrouted to APIM. Belongs to Azure API Center (ServiceId 574680a6). Two endpoints are the intentionally-public mcp.azure.com registry. Others belong to ACA/App Service.
Mitigation
Tagged AIM.Exception:Public for registry endpoints. Requesting re-routing of non-API-Center endpoints to respective owners. ● LOW recurrence

Emerging Issues

3 tracked
IcM: 788655236Owner: maximagapovTeam: BackendCreated: Apr 30Escalated Sev2→Sev1 (20+ cases auto-escalation, May 6)
Root Cause
Developer Portal user registration returning "User registration is not supported" after platform change. Multiple customers across regions affected. New support cases still being linked daily (latest May 6: apim-ynv-ai-lz-uat). Customers requesting rollbacks.
Mitigation
Marked mitigated (workaround documented), but new cases still arriving. RCA requested by multiple support engineers. Some customers need explicit service rollback.
Recurrence
● HIGH — Still accumulating. Root fix unclear. Highest customer impact this week.
Impact
20+ unique support cases. Multiple regions. Auto-escalated to Sev1. RCA timeline being demanded by CSS.
IcM: 791563622Owner: glfeoktiTeam: PlatformFiled by: tehnoonr (escalated from Sev3)
Root Cause
VNET-injected SKUv1 services lose intermediate SSL certs during OS reimage. Outbound TCP port 80 blocked in NSG prevents AIA cert download. Gateway serves incomplete chain → 502s from AFD/AppGW. Reboots use cached certs (fine); reimages need fresh download (breaks).
Mitigation
Customer fix: open outbound TCP 80 + Apply Network Settings. Transferred to Platform for permanent fix. ● HIGH recurrence — will continue with each reimage cycle for affected NSG configs.
IcM: 789293664Owner: tehnoonrTeam: ServicingLoop → PlatformCreated: Apr 30
Root Cause
GatewayV2 + HTTP.SYS breaks MAPI endpoint. URL prefix registration excludes control plane hostnames by design (ControlPlaneForwarderFilter.cs line 50–54), and management hosting failed to bind them. Code gap confirmed with HIGH confidence (5 independent evidence blocks).
Mitigation
GatewayV2 rolled back for all impacted via ACIS feature flag. Fix in ProxyHostSettingsBuilder.ProcessSpecialCases shipping in next release. ● LOW recurrence — rollout halted.

Repeated Sev2 Alerts

5 monitors
Azure Portal Blades Failed to Load (AzurePortalWAWSAlert)
Sev 2Resolved4x / 30 days
Incidents: 792621746, 792153403 Owner: maximagapov, javierbo
Root Cause
Transient external connectivity issues cause 5–7 users to see ServiceBlade load failures, barely exceeding 5-user threshold. Backend healthy. 95% of 5xx from test service apim-gfs-api-tst-centralus. DRI confirmed via Grafana: external connectivity, not regression.
Action Needed
● Fires ~weekly (noise)
All 4 recent occurrences = transient/false alarm. Raise threshold from 5 users or add sustained-failure criteria to stop Sev2 noise.
SKUv1 Activation SuccessRate Below 95% SLA
Sev 2Mitigated~Weekly
Incidents: 789198187 (Chile Central), 789035878 (East US) Owner: savukyamZero customer impact
Root Cause
Azure SQL ConflictingServerOperation — concurrent DB creation from internal RnR test services. All failures are devPortalRnr-*/rpRnr-*. Zero customer services affected. Self-recovered (transient).
Action Needed
● ~1/week across regions
Always internal-only. Filter RnR services from SLA calculation or lower severity for internal-only failures.
RP Orchestration: Health Monitor Stalled (EUAP Only)
Sev 2Mitigated8x / 30 days
Incidents: 790136306, 790136326 Owner: savukyamRegion: EUAP onlyNo customer impact
Root Cause
HealthMonitorCheckApiServicesHealth SKUv1 track stalls on SF partition. Gradual decline (2000/hr → 0) suggests resource exhaustion from prior spike. SKUv2 track unaffected. Exact mechanism undiagnosable (need SF partition metrics not in Kusto).
Action Needed
Mitigated via Geneva Action "Control Health Monitoring on Platform" (stop+start). ● Weekly in EUAP
Needs: SF partition telemetry + auto-restart logic. Root cause still undiagnosed.
AzureSecurityIR: Exposed MCP Endpoints (MCPBurndown2)
Sev 2Active
Incidents: 791543403, 791523122Owners: ethanlao, alzaslon
Summary
Microsoft-wide MCP Burndown campaign. One = customer endpoints (no action). Other = misroute from API Center with intentionally-public registry. Action: tag AIM.Assess, close per TSG, create dedicated TSG for future alerts.
AzSysLock: Git Binary Code Sign Policy Violations (Audited Only)
Sev 2Mitigated
Incidents: 791852951, 790622701 Owner: maximagapovBinaries: libssl-3-x64.dll, git.exe
Summary
AzureSecurityPack detected Git binaries violating code sign policy in audit mode. Non-blocking, no service impact. ● MEDIUM — will recur until Git binaries updated or allowlisted on affected VMs.

Other Mitigated & Resolved Incidents

10 closed
Owner: glfeoktiCreated: May 5
Root Cause
App Service P1v3 capacity exhaustion in MWH2 datacenter. HTTP 409: "No available instances." Rate dropped to 56.25%. Failures started May 1 but alert only fired May 5 (5-day gap).
Mitigation
App Service (Splinter Twin) team increased capacity. 91% failures were internal pool replenishment. Only 4 customer services affected. Alerting gap needs repair item.
Owner: brucemoeTeam: GatewayCreated: May 1
Root Cause
GatewayV2 race condition in ContinueWith callback for incomplete payload transfer. Bug 25298069. Fix exists in PR #15280162 but not yet rolled out to AOAI services.
Mitigation
Rolled back to Gateway V1 for affected services. Code fix pending deployment to AOAI fleet.
Owner: maximagapovCreated: May 2
Summary
Customer CRI linked to cert chain emerging issue (791563622). Incomplete cert on default domain post-upgrade + traffic imbalance on instance _23. Related to VNET/NSG/reimage pattern.
Owner: savukyamTeam: PlatformCreated: May 1
Summary
Customer-reported 504 timeout errors from Application Gateway to APIM backend. Investigated and mitigated by Platform on-call.
Owner: ethanlaoTeam: BackendCreated: May 1
Summary
MSRC report: PATCH endpoints (authorizationServers, openidConnectProviders, subscriptions) return cleartext secrets in response to users with RBAC. Resolved by Backend.
Sev 2MitigatedCRI
Owner: maximagapovTeam: BackendCreated: Apr 30
Summary
Customer-reported scale-out operation failure. Investigated and mitigated by Backend on-call.
Owner: savukyamTeam: PlatformCreated: Apr 30
Summary
Strategic customer reporting intermittent connection failures. Investigated and mitigated by Platform on-call.
Owner: maximagapovTeam: BackendCreated: May 6
Summary
Customer reported unexpected log entries appearing in their Application Insights instance connected to APIM. Resolved same day.
Owner: maximagapovCreated: May 6, 12:04 UTCResolved: May 6, 14:49 UTC (TTM: 2h43m)
Summary
7 users experienced ServiceBlade load failure. Confirmed transient via Grafana — connection issue to external service, not regression. Self-resolved. (See Repeated Alerts section above.)
Owner: javierboTeam: AzurePortalCreated: May 5
Summary
Same pattern as above — transient blade failures, self-healed. Mitigated within 39 minutes.

Ownership Summary

OwnerCountKey Themes
maximagapov7Developer Portal Sev1 (highest impact), Portal blades, AzSysLock, CRI cert chain, scale-out, App Insights
savukyam5SKUv1 activations, RP orchestration stalls, APPGW timeout, connection failures — all Platform/internal
ethanlao2Security: MCP endpoint, MSRC secrets leak
glfeokti2Certificate chain emerging issue, SKUv2 activation (App Service capacity)
alzaslon1MCP endpoint (API Center misroute)
brucemoe1GatewayV2 DotNetty race condition (AOAI)
tehnoonr1GatewayV2 MAPI emerging issue
javierbo1Portal blade transient (AzurePortal team)

Key Takeaways & Action Items