On-Call Rotation (Sloop)
Incident Manager
▶ INCOMING
Martin Dechev (Primary)
Vitalii Kurokhtin (Primary)
Vitalii Kurokhtin (Primary)
◀ OUTGOING
Neha Gupta (Primary)
Joaquin Vano Newman (Primary)
Rafal Mielowski (Primary)
Joaquin Vano Newman (Primary)
Rafal Mielowski (Primary)
US Sloop
▶ INCOMING
Nima Kamoosi (Primary)
Gleb Feoktistov (Backup)
Gleb Feoktistov (Backup)
◀ OUTGOING
Ethan Lao (Primary)
Nima Kamoosi (Backup)
Nima Kamoosi (Backup)
EU Sloop
▶ INCOMING
Shubham Sharma (Primary)
Ondrej Oprala (Backup)
Ondrej Oprala (Backup)
◀ OUTGOING
Maxim Agapov (Primary)
Macko Treder (Backup)
Macko Treder (Backup)
Active Incidents
3 openSev 2ActiveCRI
Summary
Customer-reported intermittent HTTP 500 errors. Opened today, under active investigation by Backend on-call. No further details available yet (newly opened).
Root Cause
SCUBA MCPBurndown2 campaign flagged unauthenticated MCP endpoint on APIM's ServiceTree ID. Security compliance finding, not a service health issue.
Mitigation
DRI assessed: "These are customer endpoints, no action APIM can take." Needs
AIM.Assess tagging per MCP TSG. ● MEDIUM recurrence — more burndown alerts expected.Root Cause
Misrouted to APIM. Belongs to Azure API Center (ServiceId
574680a6). Two endpoints are the intentionally-public mcp.azure.com registry. Others belong to ACA/App Service.Mitigation
Tagged
AIM.Exception:Public for registry endpoints. Requesting re-routing of non-API-Center endpoints to respective owners. ● LOW recurrenceEmerging Issues
3 trackedSev 1Mitigated20+ Cases
Root Cause
Developer Portal user registration returning "User registration is not supported" after platform change. Multiple customers across regions affected. New support cases still being linked daily (latest May 6:
apim-ynv-ai-lz-uat). Customers requesting rollbacks.Mitigation
Marked mitigated (workaround documented), but new cases still arriving. RCA requested by multiple support engineers. Some customers need explicit service rollback.
Recurrence
● HIGH — Still accumulating. Root fix unclear. Highest customer impact this week.
Impact
20+ unique support cases. Multiple regions. Auto-escalated to Sev1. RCA timeline being demanded by CSS.
Sev 2MitigatedRecurring CRIs
Root Cause
VNET-injected SKUv1 services lose intermediate SSL certs during OS reimage. Outbound TCP port 80 blocked in NSG prevents AIA cert download. Gateway serves incomplete chain → 502s from AFD/AppGW. Reboots use cached certs (fine); reimages need fresh download (breaks).
Mitigation
Customer fix: open outbound TCP 80 + Apply Network Settings. Transferred to Platform for permanent fix. ● HIGH recurrence — will continue with each reimage cycle for affected NSG configs.
Root Cause
GatewayV2 + HTTP.SYS breaks MAPI endpoint. URL prefix registration excludes control plane hostnames by design (
ControlPlaneForwarderFilter.cs line 50–54), and management hosting failed to bind them. Code gap confirmed with HIGH confidence (5 independent evidence blocks).Mitigation
GatewayV2 rolled back for all impacted via ACIS feature flag. Fix in
ProxyHostSettingsBuilder.ProcessSpecialCases shipping in next release. ● LOW recurrence — rollout halted.Repeated Sev2 Alerts
5 monitorsAzure Portal Blades Failed to Load (AzurePortalWAWSAlert)
Sev 2Resolved4x / 30 days
Root Cause
Transient external connectivity issues cause 5–7 users to see
ServiceBlade load failures, barely exceeding 5-user threshold. Backend healthy. 95% of 5xx from test service apim-gfs-api-tst-centralus. DRI confirmed via Grafana: external connectivity, not regression.Action Needed
● Fires ~weekly (noise)
All 4 recent occurrences = transient/false alarm. Raise threshold from 5 users or add sustained-failure criteria to stop Sev2 noise.
All 4 recent occurrences = transient/false alarm. Raise threshold from 5 users or add sustained-failure criteria to stop Sev2 noise.
SKUv1 Activation SuccessRate Below 95% SLA
Sev 2Mitigated~Weekly
Root Cause
Azure SQL
ConflictingServerOperation — concurrent DB creation from internal RnR test services. All failures are devPortalRnr-*/rpRnr-*. Zero customer services affected. Self-recovered (transient).Action Needed
● ~1/week across regions
Always internal-only. Filter RnR services from SLA calculation or lower severity for internal-only failures.
Always internal-only. Filter RnR services from SLA calculation or lower severity for internal-only failures.
RP Orchestration: Health Monitor Stalled (EUAP Only)
Sev 2Mitigated8x / 30 days
Root Cause
HealthMonitorCheckApiServicesHealth SKUv1 track stalls on SF partition. Gradual decline (2000/hr → 0) suggests resource exhaustion from prior spike. SKUv2 track unaffected. Exact mechanism undiagnosable (need SF partition metrics not in Kusto).Action Needed
Mitigated via Geneva Action "Control Health Monitoring on Platform" (stop+start). ● Weekly in EUAP
Needs: SF partition telemetry + auto-restart logic. Root cause still undiagnosed.
Needs: SF partition telemetry + auto-restart logic. Root cause still undiagnosed.
AzureSecurityIR: Exposed MCP Endpoints (MCPBurndown2)
Sev 2Active
Summary
Microsoft-wide MCP Burndown campaign. One = customer endpoints (no action). Other = misroute from API Center with intentionally-public registry. Action: tag
AIM.Assess, close per TSG, create dedicated TSG for future alerts.AzSysLock: Git Binary Code Sign Policy Violations (Audited Only)
Sev 2Mitigated
Summary
AzureSecurityPack detected Git binaries violating code sign policy in audit mode. Non-blocking, no service impact. ● MEDIUM — will recur until Git binaries updated or allowlisted on affected VMs.
Other Mitigated & Resolved Incidents
10 closedSev 2MitigatedLSI
Root Cause
App Service P1v3 capacity exhaustion in MWH2 datacenter. HTTP 409: "No available instances." Rate dropped to 56.25%. Failures started May 1 but alert only fired May 5 (5-day gap).
Mitigation
App Service (Splinter Twin) team increased capacity. 91% failures were internal pool replenishment. Only 4 customer services affected. Alerting gap needs repair item.
Root Cause
GatewayV2 race condition in
ContinueWith callback for incomplete payload transfer. Bug 25298069. Fix exists in PR #15280162 but not yet rolled out to AOAI services.Mitigation
Rolled back to Gateway V1 for affected services. Code fix pending deployment to AOAI fleet.
Summary
Customer CRI linked to cert chain emerging issue (791563622). Incomplete cert on default domain post-upgrade + traffic imbalance on instance _23. Related to VNET/NSG/reimage pattern.
Sev 2MitigatedCRI
Summary
Customer-reported 504 timeout errors from Application Gateway to APIM backend. Investigated and mitigated by Platform on-call.
Sev 2ResolvedMSRC
Summary
MSRC report: PATCH endpoints (authorizationServers, openidConnectProviders, subscriptions) return cleartext secrets in response to users with RBAC. Resolved by Backend.
Sev 2MitigatedCRI
Summary
Customer-reported scale-out operation failure. Investigated and mitigated by Backend on-call.
Sev 2MitigatedCRI
Summary
Strategic customer reporting intermittent connection failures. Investigated and mitigated by Platform on-call.
Sev 2ResolvedCRI
Summary
Customer reported unexpected log entries appearing in their Application Insights instance connected to APIM. Resolved same day.
Sev 2ResolvedLSI
Summary
7 users experienced ServiceBlade load failure. Confirmed transient via Grafana — connection issue to external service, not regression. Self-resolved. (See Repeated Alerts section above.)
Sev 2MitigatedLSI
Summary
Same pattern as above — transient blade failures, self-healed. Mitigated within 39 minutes.
Ownership Summary
| Owner | Count | Key Themes |
|---|---|---|
| maximagapov | 7 | Developer Portal Sev1 (highest impact), Portal blades, AzSysLock, CRI cert chain, scale-out, App Insights |
| savukyam | 5 | SKUv1 activations, RP orchestration stalls, APPGW timeout, connection failures — all Platform/internal |
| ethanlao | 2 | Security: MCP endpoint, MSRC secrets leak |
| glfeokti | 2 | Certificate chain emerging issue, SKUv2 activation (App Service capacity) |
| alzaslon | 1 | MCP endpoint (API Center misroute) |
| brucemoe | 1 | GatewayV2 DotNetty race condition (AOAI) |
| tehnoonr | 1 | GatewayV2 MAPI emerging issue |
| javierbo | 1 | Portal blade transient (AzurePortal team) |
Key Takeaways & Action Items
- 1Highest customer impact: Developer Portal Registration failure (Sev1, 20+ cases, still accumulating). RCA and permanent fix urgently needed.
- 2GatewayV2 stability: Two separate GatewayV2 bugs (HTTP.SYS MAPI 503 + DotNetty EncoderException). Both mitigated via V1 rollback. Fixes exist but rollout needs caution.
- 3Certificate chain pattern: Incomplete cert after reimage generates CRIs. Platform should pre-cache certs or validate post-reimage.
- 4Reduce toil: Portal Blades (~weekly, transient), SKUv1 Activation (~weekly, internal-only), EUAP orchestration stalls (~weekly) — adjust thresholds/severity.
- 5SKUv2 alerting gap: WUS2 failures persisted 5 days before Sev2 fired. Need longer-lookback alert for earlier detection.
- 6Security campaign: MCP Burndown alerts will continue. Dedicated TSG recommended. Current incidents are customer endpoints — no APIM service risk.