| Incident / DRI / Status | Questions & Answers | Feedback & Work Items |
|---|---|---|
| INC 788655236 Sev2 Mitigated Outage Declared Title: Dev Portal Registration Fails - "User registration is not supported." DRI: Maxim A Created: 2026-04-30 Impact: 2026-04-22 12:27 Tickets: 2604290050002739, 2604300050002605, 2604300030005286 | 1. Customer ComplaintMultiple customers - Dev Portal registration failing: "User registration is not supported." Entra ID users affected. 3 support tickets. 2. What customer saw400 ValidationError on registration via Entra ID. Even with valid Entra ID sign-in configured, registration blocked. 3. CSS telemetryRegression in 0.51: PR enforcing registrationEnabled had side effect. Disabling basic auth set registrationEnabled=false, blocking ALL registration including Entra ID. Impact started Apr 22 - 8 days before CRI. 4. Diagnosis & FixDeveloper Portal Code Bug in 0.51. Incorrectly linked basic auth status with overall registration. Ethan mitigated Apr 30 16:00 UTC via automated rollback. CSS filed "APIMSKUV1Rollback" ICMs. 5. Monitoring Miss?Yes - no test for non-basic-auth registration. 8-day detection gap. 6. RepairsDecouple registrationEnabled from basic auth. Fix in 0.52. Monitor 400s on registration. Automated test coverage. | Feedback: (placeholder) Work Items:In Review Bug 37757544 - Fix registrationEnabled blocks Entra IDTo Do Task 37757611 - DevPortal automated tests |
| INC 51000001003804 Sev2 Mitigated Title: APIM Scale out failure DRI: Maxim A Created: 2026-04-30 Service: apim-ads-cus-entbusops-prd-001 | 1. Customer ComplaintAPIM scale-out operation failing. 2. What customer sawUnable to provision additional units. 3. CSS telemetryAll nodes reporting 50x. Role instance count deviating. ProxyRequest: elevated 500s across all VMs. Determined not an APIM platform issue - customer/backend-side 500s. 4. Diagnosis & FixNot APIM issue. Maxim A determined backend-side. Mitigated Apr 30. 5. Monitoring Miss?No - not platform issue. 6. RepairsN/A. | Feedback: (placeholder) Work Items:None |
| INC 51000001002629 Sev2 Mitigated Title: UNIFIED STRATEGIC | Intermittent connection failure DRI: Ethan Created: 2026-04-30 Customer: Strategic/Unified | 1. Customer ComplaintStrategic/Unified customer - intermittent connection failures. 2. What customer sawRequests intermittently failing. Failed requests absent from APIM GatewayLogs (never reached gateway). 3. CSS telemetryClientConnectionFailure errors concentrated on gwhost_2 ("Bad VM Variant B" - processing traffic at normal volume but generating errors). OS auto-upgrade correlated with VM restart. 4. Diagnosis & FixFaulty VM (Variant B). Ethan identified gwhost_2 as error source via per-VM analysis. Transferred to Platform for RCA. Mitigated Apr 30. 5. Monitoring Miss?Yes - per-VM error distribution not checked by monitoring. Faulty VM can process normal volume while generating errors. 6. RepairsNew tool: GetProxyErrorsByRoleInstance. Per-VM error analysis in investigation skill. "Bad VM Variant B" pattern documented. | Feedback: (placeholder) Work Items:None |
| INC 51000001005278 Sev2 Mitigated Title: Requests from APPGW to APIM - 504 Timeout DRI: Ethan Created: 2026-05-01 Service: apim-prod-eu-01 (Premium, Internal VNET, WEU) | 1. Customer Complaint504 timeout from Application Gateway to APIM. 2. What customer sawHTTP 504 timeout. Requests not completing. Failed requests absent from APIM logs. 3. CSS telemetrygwhost_33: ConnectionIdle errors in HttpSys + BackendConnectionFailures. ILB backend health unhealthy from 2026-04-28 23:15 UTC. Single VM network degradation. 4. Diagnosis & FixSingle VM Network Degradation. Ethan: ReplaceVM on gwhost_33 (orchestration: apim-prod-eu-01_ManageCompute_63b54556). Mitigated May 1. Customer confirmed. 5. Monitoring Miss?Yes - upstream APPGW/ILB 504 from unhealthy VM not detected proactively. 6. RepairsEnhanced "Single VM Network Degradation" pattern with APPGW/ILB/504 keywords. ConnectionIdle/HttpSys signature. ILB health check in investigation. | Feedback: (placeholder) Work Items:None |
| INC 789605450 Sev2 Active Title: Intermittent 500s DotNetty.EncoderException (AOAI) DRI: Ethan Created: 2026-05-01 Impact: 2026-04-30 Customer: AOAI (Partner) Service: cognitivesecprod-02 + all AOAI | 1. Customer ComplaintAOAI team - intermittent 500 from APIM with DotNetty.Codecs.EncoderException in request forwarding. All APIM instances. Rate up to ~5%. 2. What customer sawHTTP 500: Source: request-forwarder, Reason: GatewayFailure, Message: DotNetty.Codecs.EncoderException. No backend logs. Retries fail. Excel data cleaning (UK South) SLA below 95%. 3. CSS telemetryNot model-specific. Multiple deployments/regions/subs. Correlates with traffic volume. 10-30 days ongoing with recent spike. Errors in request-forwarding stage (backend invocation). 4. Diagnosis & FixACTIVE - hypothesis: backend connection/encoding issue, HTTP/2 or DotNetty pipeline. Hyena team engagement planned. Ethan/Joaquin investigating. 5. Monitoring Miss?Yes - existed 10-30 days before ICM. Below alert thresholds. 6. RepairsTBD pending root cause. Track DotNetty errors as distinct failure class. Anomaly detection. | Feedback: (placeholder) Work Items:Done Bug 25298069 - DotNetty EncoderException (Gateway v2) |
| INC 51000001005891 Sev2 Mitigated Title: [CRI][LEGRAND] Incomplete cert chain + traffic imbalance DRI: Maxim A Created: 2026-05-02 Customer: LEGRAND FRANCE SA Related: Recurrence of INC 51000000994948 | 1. Customer ComplaintLEGRAND FRANCE SA - APIM default domain incomplete cert chain + unbalanced traffic on instance _23 after May 1 07:00 UTC. 2. What customer sawTLS failures for cert chain validation. 64% traffic on one instance. Behind AFD causing 502s. 3. CSS telemetrySame root cause as INC 51000000994948: NSG blocking outbound TCP port 80 preventing CRL/AIA after OS upgrade. Storage:443 blocked causing dism.exe hangs and bootstrapper failure. ILB hash skew after VMs dropped. 4. Diagnosis & FixRecurrence. Maxim A: (1) disabled traffic, (2) scaled up, (3) waited for VMs, (4) fixed wrong DNS record for mgmt endpoint. Customer using second service. Mitigated May 2. 5. Monitoring Miss?Yes - second occurrence. First fix not comprehensive. No cert chain or NSG validation. 6. RepairsPermanent cert chain fix needed. VNET dependency checks. NSG validation for required outbound ports. | Feedback: (placeholder) Work Items:None |
| INC 783036395 Sev2 Active Emerging Issue Title: Suspended Services do not recover after subscription renewal DRI: Omar Created: 2026-04-20 Impact: Multiple customers stuck for days | 1. Customer ComplaintCustomer services stuck and not recovering after subscriptions were re-enabled. Services not coming back for days. 2. What customer sawService shown as Deleted/Suspended in Portal for days after subscription being unsuspended. No actual error or banner displayed to the customer. 3. CSS telemetryServices not being created, no telemetry at all for 3 different CRIs. CSS opened an emerging issue. No progress visible in orchestration logs. 4. Diagnosis & FixUndelete queue blocked by poison pills. With SRE Agent help, identified that 5. Monitoring Miss?Yes — multiple gaps: 6. Repairs• Done: Skip hotfix rolled out behind beta feature (poison pill bypass) | Feedback: (placeholder) Work Items:Done Skip hotfix (poison pill bypass) - rolled out behind beta featureTo Do Cleanup quota-exhausted SKUv2 preprovisioning RGsTo Do Decouple undelete from SubscriptionLifecycleOrchestrationTo Do SLA metric / Alert on undelete consistently failing |
| Theme | Incidents | Description |
|---|---|---|
| Version 0.51 Regressions | 787095677, 788655236, 775932060 | Grandfathered limits wiped + Dev Portal registration broken. ~30+ rollback CRIs generated. Largest customer impact. |
| Certificate Lifecycle (5) | 51000000976727, 21000000984572, 51000000994948, 51000001005891, 21000000995987 | Workspace certs not renewed, default domain incomplete chains (recurring x2), managed cert expiry. Systemic gap. |
| ForceUpgrade Bypass | 777722043, 778045793 | Feature flag bypassed SDP, deployed unsupported version. AOAI/Cognitive Services outage. |
| DotNetty Encoder (ACTIVE) | 789605450 | Intermittent 500s in request forwarding, AOAI-wide. 10-30 day steady-state before detection. |
| Capacity / Quota (3) | 779770640, 780717094, 51000001003804 | DNS quota, prepool exhaustion (~3 weeks undetected), scale-out. No proactive monitoring. |
| Per-VM Health (4) | 51000000976163, 51000000976845, 51000001002629, 51000001005278 | Infra degradation undetected by probes. Auto-healing slow (~16h). "Bad VM Variant B" pattern. |
| Stuck Operations (2) | 51000000982831, 783036395 | Update stuck 6+ hours (VMSS + Redis deadlock). Suspended services not recovering (undelete backlog). |
Generated by Azure SRE Agent - May 5, 2026