APIM Incident Report — May 6, 2026

⇄

On-Call Rotation (Sloop)

Incident Manager

▶ INCOMING

Martin Dechev (Primary)
Vitalii Kurokhtin (Primary)

◀ OUTGOING

Neha Gupta (Primary)
Joaquin Vano Newman (Primary)
Rafal Mielowski (Primary)

US Sloop

▶ INCOMING

Nima Kamoosi (Primary)
Gleb Feoktistov (Backup)

◀ OUTGOING

Ethan Lao (Primary)
Nima Kamoosi (Backup)

EU Sloop

▶ INCOMING

Shubham Sharma (Primary)
Ondrej Oprala (Backup)

◀ OUTGOING

Maxim Agapov (Primary)
Macko Treder (Backup)

⚠

Active Incidents

3 open

Intermittent 500 responses from APIM

Sev 2ActiveCRI

IcM: 51000001009580Owner: maximagapovTeam: BackendCreated: May 6, 08:26 UTC

Summary

Customer-reported intermittent HTTP 500 errors. Opened today, under active investigation by Backend on-call. No further details available yet (newly opened).

[Publisher-Prod] AzureSecurityIR — Exposed MCP Endpoint | API Management

Sev 2ActiveLSI

IcM: 791543403Owner: ethanlaoTeam: BackendCreated: May 4

Root Cause

SCUBA MCPBurndown2 campaign flagged unauthenticated MCP endpoint on APIM's ServiceTree ID. Security compliance finding, not a service health issue.

Mitigation

DRI assessed: "These are customer endpoints, no action APIM can take." Needs AIM.Assess tagging per MCP TSG. ● MEDIUM recurrence — more burndown alerts expected.

[Publisher-Prod] AzureSecurityIR — Exposed MCP Endpoint | Azure API Center

Sev 2ActiveLSI

IcM: 791523122Owner: alzaslonTeam: ServicingLoopCreated: May 4

Root Cause

Misrouted to APIM. Belongs to Azure API Center (ServiceId 574680a6). Two endpoints are the intentionally-public mcp.azure.com registry. Others belong to ACA/App Service.

Mitigation

Tagged AIM.Exception:Public for registry endpoints. Requesting re-routing of non-API-Center endpoints to respective owners. ● LOW recurrence

★

Emerging Issues

3 tracked

Emerging Issue: Developer Portal User Registration Fails — "User registration is not supported."

Sev 1Mitigated20+ Cases

IcM: 788655236Owner: maximagapovTeam: BackendCreated: Apr 30Escalated Sev2→Sev1 (20+ cases auto-escalation, May 6)

Root Cause

Developer Portal user registration returning "User registration is not supported" after platform change. Multiple customers across regions affected. New support cases still being linked daily (latest May 6: apim-ynv-ai-lz-uat). Customers requesting rollbacks.

Mitigation

Marked mitigated (workaround documented), but new cases still arriving. RCA requested by multiple support engineers. Some customers need explicit service rollback.

Recurrence

● HIGH — Still accumulating. Root fix unclear. Highest customer impact this week.

Impact

20+ unique support cases. Multiple regions. Auto-escalated to Sev1. RCA timeline being demanded by CSS.

Emerging Issue: Incomplete Certificate Chain for Gateway Endpoint After Upgrade

Sev 2MitigatedRecurring CRIs

IcM: 791563622Owner: glfeoktiTeam: PlatformFiled by: tehnoonr (escalated from Sev3)

Root Cause

VNET-injected SKUv1 services lose intermediate SSL certs during OS reimage. Outbound TCP port 80 blocked in NSG prevents AIA cert download. Gateway serves incomplete chain → 502s from AFD/AppGW. Reboots use cached certs (fine); reimages need fresh download (breaks).

Mitigation

Customer fix: open outbound TCP 80 + Apply Network Settings. Transferred to Platform for permanent fix. ● HIGH recurrence — will continue with each reimage cycle for affected NSG configs.

Emerging Issue: 503 from MAPI Endpoint After GatewayV2 + HTTP.SYS Upgrade

Sev 3Mitigated

IcM: 789293664Owner: tehnoonrTeam: ServicingLoop → PlatformCreated: Apr 30

Root Cause

GatewayV2 + HTTP.SYS breaks MAPI endpoint. URL prefix registration excludes control plane hostnames by design (ControlPlaneForwarderFilter.cs line 50–54), and management hosting failed to bind them. Code gap confirmed with HIGH confidence (5 independent evidence blocks).

Mitigation

GatewayV2 rolled back for all impacted via ACIS feature flag. Fix in ProxyHostSettingsBuilder.ProcessSpecialCases shipping in next release. ● LOW recurrence — rollout halted.

↻

Repeated Sev2 Alerts

5 monitors

Azure Portal Blades Failed to Load (AzurePortalWAWSAlert)

Sev 2Resolved4x / 30 days

Incidents: 792621746, 792153403 Owner: maximagapov, javierbo

Root Cause

Transient external connectivity issues cause 5–7 users to see ServiceBlade load failures, barely exceeding 5-user threshold. Backend healthy. 95% of 5xx from test service apim-gfs-api-tst-centralus. DRI confirmed via Grafana: external connectivity, not regression.

Action Needed

● Fires ~weekly (noise)
All 4 recent occurrences = transient/false alarm. Raise threshold from 5 users or add sustained-failure criteria to stop Sev2 noise.

SKUv1 Activation SuccessRate Below 95% SLA

Sev 2Mitigated~Weekly

Incidents: 789198187 (Chile Central), 789035878 (East US) Owner: savukyamZero customer impact

Root Cause

Azure SQL ConflictingServerOperation — concurrent DB creation from internal RnR test services. All failures are devPortalRnr-*/rpRnr-*. Zero customer services affected. Self-recovered (transient).

Action Needed

● ~1/week across regions
Always internal-only. Filter RnR services from SLA calculation or lower severity for internal-only failures.

RP Orchestration: Health Monitor Stalled (EUAP Only)

Sev 2Mitigated8x / 30 days

Incidents: 790136306, 790136326 Owner: savukyamRegion: EUAP onlyNo customer impact

Root Cause

HealthMonitorCheckApiServicesHealth SKUv1 track stalls on SF partition. Gradual decline (2000/hr → 0) suggests resource exhaustion from prior spike. SKUv2 track unaffected. Exact mechanism undiagnosable (need SF partition metrics not in Kusto).

Action Needed

Mitigated via Geneva Action "Control Health Monitoring on Platform" (stop+start). ● Weekly in EUAP
Needs: SF partition telemetry + auto-restart logic. Root cause still undiagnosed.

AzureSecurityIR: Exposed MCP Endpoints (MCPBurndown2)

Sev 2Active

Incidents: 791543403, 791523122Owners: ethanlao, alzaslon

Summary

Microsoft-wide MCP Burndown campaign. One = customer endpoints (no action). Other = misroute from API Center with intentionally-public registry. Action: tag AIM.Assess, close per TSG, create dedicated TSG for future alerts.

AzSysLock: Git Binary Code Sign Policy Violations (Audited Only)

Sev 2Mitigated

Incidents: 791852951, 790622701 Owner: maximagapovBinaries: libssl-3-x64.dll, git.exe

Summary

AzureSecurityPack detected Git binaries violating code sign policy in audit mode. Non-blocking, no service impact. ● MEDIUM — will recur until Git binaries updated or allowlisted on affected VMs.

✓

Other Mitigated & Resolved Incidents

10 closed

SKUv2 Customer Activation SuccessRate Below 95% — West US 2

Sev 2MitigatedLSI

Owner: glfeoktiCreated: May 5

Root Cause

App Service P1v3 capacity exhaustion in MWH2 datacenter. HTTP 409: "No available instances." Rate dropped to 56.25%. Failures started May 1 but alert only fired May 5 (5-day gap).

Mitigation

App Service (Splinter Twin) team increased capacity. 91% failures were internal pool replenishment. Only 4 customer services affected. Alerting gap needs repair item.

Intermittent 500 errors with DotNetty.Codecs.EncoderException (AOAI deployments)

Sev 2MitigatedCRI

Owner: brucemoeTeam: GatewayCreated: May 1

Root Cause

GatewayV2 race condition in ContinueWith callback for incomplete payload transfer. Bug 25298069. Fix exists in PR #15280162 but not yet rolled out to AOAI services.

Mitigation

Rolled back to Gateway V1 for affected services. Code fix pending deployment to AOAI fleet.

[CRI] LEGRAND FRANCE — Incomplete cert chain + unbalanced traffic after May 1 upgrade

Sev 2MitigatedCRI

Owner: maximagapovCreated: May 2

Summary

Customer CRI linked to cert chain emerging issue (791563622). Incomplete cert on default domain post-upgrade + traffic imbalance on instance _23. Related to VNET/NSG/reimage pattern.

Requests from APPGW to APIM — 504 Timeout

Sev 2MitigatedCRI

Owner: savukyamTeam: PlatformCreated: May 1

Summary

Customer-reported 504 timeout errors from Application Gateway to APIM backend. Investigated and mitigated by Platform on-call.

[MSRC] PATCH response leaks cleartext secrets via RBAC

Sev 2ResolvedMSRC

Owner: ethanlaoTeam: BackendCreated: May 1

Summary

MSRC report: PATCH endpoints (authorizationServers, openidConnectProviders, subscriptions) return cleartext secrets in response to users with RBAC. Resolved by Backend.

APIM Scale Out Failure

Sev 2MitigatedCRI

Owner: maximagapovTeam: BackendCreated: Apr 30

Summary

Customer-reported scale-out operation failure. Investigated and mitigated by Backend on-call.

UNIFIED STRATEGIC | Intermittent Connection Failure

Sev 2MitigatedCRI

Owner: savukyamTeam: PlatformCreated: Apr 30

Summary

Strategic customer reporting intermittent connection failures. Investigated and mitigated by Platform on-call.

Unwanted log entries in API Management Application Insights

Sev 2ResolvedCRI

Owner: maximagapovTeam: BackendCreated: May 6

Summary

Customer reported unexpected log entries appearing in their Application Insights instance connected to APIM. Resolved same day.

[Public] AzurePortalWAWSAlert — Blades failed to load (May 6)

Sev 2ResolvedLSI

Owner: maximagapovCreated: May 6, 12:04 UTCResolved: May 6, 14:49 UTC (TTM: 2h43m)

Summary

7 users experienced ServiceBlade load failure. Confirmed transient via Grafana — connection issue to external service, not regression. Self-resolved. (See Repeated Alerts section above.)

[Public] AzurePortalWAWSAlert — Blades failed to load (May 5)

Sev 2MitigatedLSI

Owner: javierboTeam: AzurePortalCreated: May 5

Summary

Same pattern as above — transient blade failures, self-healed. Mitigated within 39 minutes.

●

Ownership Summary

Owner	Count	Key Themes
maximagapov	7	Developer Portal Sev1 (highest impact), Portal blades, AzSysLock, CRI cert chain, scale-out, App Insights
savukyam	5	SKUv1 activations, RP orchestration stalls, APPGW timeout, connection failures — all Platform/internal
ethanlao	2	Security: MCP endpoint, MSRC secrets leak
glfeokti	2	Certificate chain emerging issue, SKUv2 activation (App Service capacity)
alzaslon	1	MCP endpoint (API Center misroute)
brucemoe	1	GatewayV2 DotNetty race condition (AOAI)
tehnoonr	1	GatewayV2 MAPI emerging issue
javierbo	1	Portal blade transient (AzurePortal team)

⚡

Key Takeaways & Action Items

1Highest customer impact: Developer Portal Registration failure (Sev1, 20+ cases, still accumulating). RCA and permanent fix urgently needed.
2GatewayV2 stability: Two separate GatewayV2 bugs (HTTP.SYS MAPI 503 + DotNetty EncoderException). Both mitigated via V1 rollback. Fixes exist but rollout needs caution.
3Certificate chain pattern: Incomplete cert after reimage generates CRIs. Platform should pre-cache certs or validate post-reimage.
4Reduce toil: Portal Blades (~weekly, transient), SKUv1 Activation (~weekly, internal-only), EUAP orchestration stalls (~weekly) — adjust thresholds/severity.
5SKUv2 alerting gap: WUS2 failures persisted 5 days before Sev2 fired. Need longer-lookback alert for earlier detection.
6Security campaign: MCP Burndown alerts will continue. Dedicated TSG recommended. Current incidents are customer endpoints — no APIM service risk.