API Management • ServicingLoop
API MANAGEMENT
Servicing Loop Operation — Incident reviews, weekly handoffs, alert trends, and outage tracking for the APIM on-call team.
Review Questions — Each Sev2 CRI
1
Customer Complaint — What was the complaint? Business impact, service/SKU/region?
2
Product Experience — What did the customer see? Error messages, behaviors?
3
CSS Telemetry — What did CSS find? Kusto tables, queries, signals?
4
Diagnosis & Fix — Root cause, classification, mitigation steps?
5
Monitoring Miss — Could we have detected proactively?
6
Repairs — Product fixes, new telemetry, improved alerts, docs?
What’s in a Weekly Review
1
On-Call Rotation — Incoming/outgoing IM, US Sloop, EU Sloop
2
Active Incidents — Open Sev2+ needing attention
3
Emerging Issues — Patterns generating multiple CRIs
4
Repeated Alerts — Monitors firing 2+ times (noise/action)
5
Ownership & Trends — Who owned what, recurring themes
6
Key Takeaways — Action items for the coming week
Top Alerts by Severity — Last 7 Days
| Monitor | Sev | Count | Status | Root Cause Pattern |
| AzurePortalWAWSAlert — Blades failed to load |
SEV2 |
2 |
Resolved |
Transient external connectivity (noise) |
| AzureSecurityIRAlert — Exposed MCP Endpoint |
SEV2 |
2 |
Active |
MCPBurndown2 campaign (compliance) |
| AzureSecurityPackProd — AzSysLock CI violation |
SEV2 |
2 |
Mitigated |
Git binary code sign policy (audit only) |
| SKUv1 Activation SuccessRate Below 95% |
SEV2 |
2 |
Mitigated |
SQL ConflictingServerOperation — internal RnR only |
| Eternal Orchestration — Health Monitor Stalled |
SEV2 |
2 |
Mitigated |
SF partition resource exhaustion (EUAP only) |
| SKUv2 Customer Activation Below 95% (westus2) |
SEV2 |
1 |
Mitigated |
App Service P1v3 capacity exhaustion |
| Developer Portal Registration Fails |
SEV1 |
1 |
Mitigated |
Platform change — 20+ customer cases |
Severity Breakdown (7d)
1
Sev 1: 1 incident (Developer Portal — escalated from Sev2)
20
Sev 2: 20 incidents (3 active, 14 mitigated, 3 resolved)
1
Sev 3: 1 emerging issue tracked (GatewayV2 MAPI)
Noise Reduction Opportunities
!
Portal Blades — Raise threshold from 5 users; fires weekly as false alarm
!
SKUv1 Activation — Filter RnR services from SLA; always internal-only
!
EUAP Orchestration — Add auto-restart logic; 8 incidents/month in canary
PIR Status Tracker
PIR Pending
Developer Portal Registration Failure
IcM: 788655236 · Sev1 · 20+ cases
Status: RCA requested by CSS · Owner: maximagapov
PIR Due: TBD
PIR Complete
GatewayV2 MAPI 503 After Upgrade
IcM: 789293664 · Sev3 · Mitigated
Root cause: HTTP.SYS hostname binding gap
Fix: PR shipping in next release
PIR Complete
DotNetty EncoderException (AOAI)
IcM: 789605450 · Sev2 · Mitigated
Root cause: GatewayV2 race condition in ContinueWith
Fix: PR #15280162 pending AOAI rollout