API Management • ServicingLoop

API MANAGEMENT

Servicing Loop Operation — Incident reviews, weekly handoffs, alert trends, and outage tracking for the APIM on-call team.

Review Questions — Each Sev2 CRI
1
Customer Complaint — What was the complaint? Business impact, service/SKU/region?
2
Product Experience — What did the customer see? Error messages, behaviors?
3
CSS Telemetry — What did CSS find? Kusto tables, queries, signals?
4
Diagnosis & Fix — Root cause, classification, mitigation steps?
5
Monitoring Miss — Could we have detected proactively?
6
Repairs — Product fixes, new telemetry, improved alerts, docs?
What’s in a Weekly Review
1
On-Call Rotation — Incoming/outgoing IM, US Sloop, EU Sloop
2
Active Incidents — Open Sev2+ needing attention
3
Emerging Issues — Patterns generating multiple CRIs
4
Repeated Alerts — Monitors firing 2+ times (noise/action)
5
Ownership & Trends — Who owned what, recurring themes
6
Key Takeaways — Action items for the coming week
Top Alerts by Severity — Last 7 Days
MonitorSevCountStatusRoot Cause Pattern
AzurePortalWAWSAlert — Blades failed to load SEV2 2 Resolved Transient external connectivity (noise)
AzureSecurityIRAlert — Exposed MCP Endpoint SEV2 2 Active MCPBurndown2 campaign (compliance)
AzureSecurityPackProd — AzSysLock CI violation SEV2 2 Mitigated Git binary code sign policy (audit only)
SKUv1 Activation SuccessRate Below 95% SEV2 2 Mitigated SQL ConflictingServerOperation — internal RnR only
Eternal Orchestration — Health Monitor Stalled SEV2 2 Mitigated SF partition resource exhaustion (EUAP only)
SKUv2 Customer Activation Below 95% (westus2) SEV2 1 Mitigated App Service P1v3 capacity exhaustion
Developer Portal Registration Fails SEV1 1 Mitigated Platform change — 20+ customer cases
Severity Breakdown (7d)
1
Sev 1: 1 incident (Developer Portal — escalated from Sev2)
20
Sev 2: 20 incidents (3 active, 14 mitigated, 3 resolved)
1
Sev 3: 1 emerging issue tracked (GatewayV2 MAPI)
Noise Reduction Opportunities
!
Portal Blades — Raise threshold from 5 users; fires weekly as false alarm
!
SKUv1 Activation — Filter RnR services from SLA; always internal-only
!
EUAP Orchestration — Add auto-restart logic; 8 incidents/month in canary
PIR Status Tracker
PIR Pending
Developer Portal Registration Failure
IcM: 788655236 · Sev1 · 20+ cases
Status: RCA requested by CSS · Owner: maximagapov
PIR Due: TBD
PIR Complete
GatewayV2 MAPI 503 After Upgrade
IcM: 789293664 · Sev3 · Mitigated
Root cause: HTTP.SYS hostname binding gap
Fix: PR shipping in next release
PIR Complete
DotNetty EncoderException (AOAI)
IcM: 789605450 · Sev2 · Mitigated
Root cause: GatewayV2 race condition in ContinueWith
Fix: PR #15280162 pending AOAI rollout