WEEK 4 - April 28 - May 4 | Sev2 CRI: 7 | RAs: TBD

Incident / DRI / StatusQuestions & AnswersFeedback & Work Items
INC 788655236
Sev2 Mitigated
Outage Declared

Title: Dev Portal Registration Fails - "User registration is not supported."
DRI: Maxim A
Created: 2026-04-30
Impact: 2026-04-22 12:27
Tickets: 2604290050002739, 2604300050002605, 2604300030005286

1. Customer Complaint

Multiple customers - Dev Portal registration failing: "User registration is not supported." Entra ID users affected. 3 support tickets.

2. What customer saw

400 ValidationError on registration via Entra ID. Even with valid Entra ID sign-in configured, registration blocked.

3. CSS telemetry

Regression in 0.51: PR enforcing registrationEnabled had side effect. Disabling basic auth set registrationEnabled=false, blocking ALL registration including Entra ID. Impact started Apr 22 - 8 days before CRI.

4. Diagnosis & Fix

Developer Portal Code Bug in 0.51. Incorrectly linked basic auth status with overall registration. Ethan mitigated Apr 30 16:00 UTC via automated rollback. CSS filed "APIMSKUV1Rollback" ICMs.

5. Monitoring Miss?

Yes - no test for non-basic-auth registration. 8-day detection gap.

6. Repairs

Decouple registrationEnabled from basic auth. Fix in 0.52. Monitor 400s on registration. Automated test coverage.

Feedback: (placeholder)
Work Items:
Bug 37757544 - Fix registrationEnabled blocks Entra IDTo Do Task 37757611 - DevPortal automated tests
INC 51000001003804
Sev2 Mitigated

Title: APIM Scale out failure
DRI: Maxim A
Created: 2026-04-30
Service: apim-ads-cus-entbusops-prd-001

1. Customer Complaint

APIM scale-out operation failing.

2. What customer saw

Unable to provision additional units.

3. CSS telemetry

All nodes reporting 50x. Role instance count deviating. ProxyRequest: elevated 500s across all VMs. Determined not an APIM platform issue - customer/backend-side 500s.

4. Diagnosis & Fix

Not APIM issue. Maxim A determined backend-side. Mitigated Apr 30.

5. Monitoring Miss?

No - not platform issue.

6. Repairs

N/A.

Feedback: (placeholder)
Work Items:
None
INC 51000001002629
Sev2 Mitigated

Title: UNIFIED STRATEGIC | Intermittent connection failure
DRI: Ethan
Created: 2026-04-30
Customer: Strategic/Unified

1. Customer Complaint

Strategic/Unified customer - intermittent connection failures.

2. What customer saw

Requests intermittently failing. Failed requests absent from APIM GatewayLogs (never reached gateway).

3. CSS telemetry

ClientConnectionFailure errors concentrated on gwhost_2 ("Bad VM Variant B" - processing traffic at normal volume but generating errors). OS auto-upgrade correlated with VM restart.

4. Diagnosis & Fix

Faulty VM (Variant B). Ethan identified gwhost_2 as error source via per-VM analysis. Transferred to Platform for RCA. Mitigated Apr 30.

5. Monitoring Miss?

Yes - per-VM error distribution not checked by monitoring. Faulty VM can process normal volume while generating errors.

6. Repairs

New tool: GetProxyErrorsByRoleInstance. Per-VM error analysis in investigation skill. "Bad VM Variant B" pattern documented.

Feedback: (placeholder)
Work Items:
None
INC 51000001005278
Sev2 Mitigated

Title: Requests from APPGW to APIM - 504 Timeout
DRI: Ethan
Created: 2026-05-01
Service: apim-prod-eu-01 (Premium, Internal VNET, WEU)

1. Customer Complaint

504 timeout from Application Gateway to APIM.

2. What customer saw

HTTP 504 timeout. Requests not completing. Failed requests absent from APIM logs.

3. CSS telemetry

gwhost_33: ConnectionIdle errors in HttpSys + BackendConnectionFailures. ILB backend health unhealthy from 2026-04-28 23:15 UTC. Single VM network degradation.

4. Diagnosis & Fix

Single VM Network Degradation. Ethan: ReplaceVM on gwhost_33 (orchestration: apim-prod-eu-01_ManageCompute_63b54556). Mitigated May 1. Customer confirmed.

5. Monitoring Miss?

Yes - upstream APPGW/ILB 504 from unhealthy VM not detected proactively.

6. Repairs

Enhanced "Single VM Network Degradation" pattern with APPGW/ILB/504 keywords. ConnectionIdle/HttpSys signature. ILB health check in investigation.

Feedback: (placeholder)
Work Items:
None
INC 789605450
Sev2 Active

Title: Intermittent 500s DotNetty.EncoderException (AOAI)
DRI: Ethan
Created: 2026-05-01
Impact: 2026-04-30
Customer: AOAI (Partner)
Service: cognitivesecprod-02 + all AOAI

1. Customer Complaint

AOAI team - intermittent 500 from APIM with DotNetty.Codecs.EncoderException in request forwarding. All APIM instances. Rate up to ~5%.

2. What customer saw

HTTP 500: Source: request-forwarder, Reason: GatewayFailure, Message: DotNetty.Codecs.EncoderException. No backend logs. Retries fail. Excel data cleaning (UK South) SLA below 95%.

3. CSS telemetry

Not model-specific. Multiple deployments/regions/subs. Correlates with traffic volume. 10-30 days ongoing with recent spike. Errors in request-forwarding stage (backend invocation).

4. Diagnosis & Fix

ACTIVE - hypothesis: backend connection/encoding issue, HTTP/2 or DotNetty pipeline. Hyena team engagement planned. Ethan/Joaquin investigating.

5. Monitoring Miss?

Yes - existed 10-30 days before ICM. Below alert thresholds.

6. Repairs

TBD pending root cause. Track DotNetty errors as distinct failure class. Anomaly detection.

Feedback: (placeholder)
Work Items:
Done Bug 25298069 - DotNetty EncoderException (Gateway v2)
INC 51000001005891
Sev2 Mitigated

Title: [CRI][LEGRAND] Incomplete cert chain + traffic imbalance
DRI: Maxim A
Created: 2026-05-02
Customer: LEGRAND FRANCE SA
Related: Recurrence of INC 51000000994948

1. Customer Complaint

LEGRAND FRANCE SA - APIM default domain incomplete cert chain + unbalanced traffic on instance _23 after May 1 07:00 UTC.

2. What customer saw

TLS failures for cert chain validation. 64% traffic on one instance. Behind AFD causing 502s.

3. CSS telemetry

Same root cause as INC 51000000994948: NSG blocking outbound TCP port 80 preventing CRL/AIA after OS upgrade. Storage:443 blocked causing dism.exe hangs and bootstrapper failure. ILB hash skew after VMs dropped.

4. Diagnosis & Fix

Recurrence. Maxim A: (1) disabled traffic, (2) scaled up, (3) waited for VMs, (4) fixed wrong DNS record for mgmt endpoint. Customer using second service. Mitigated May 2.

5. Monitoring Miss?

Yes - second occurrence. First fix not comprehensive. No cert chain or NSG validation.

6. Repairs

Permanent cert chain fix needed. VNET dependency checks. NSG validation for required outbound ports.

Feedback: (placeholder)
Work Items:
None
INC 783036395
Sev2 Active
Emerging Issue

Title: Suspended Services do not recover after subscription renewal
DRI: Omar
Created: 2026-04-20
Impact: Multiple customers stuck for days

1. Customer Complaint

Customer services stuck and not recovering after subscriptions were re-enabled. Services not coming back for days.

2. What customer saw

Service shown as Deleted/Suspended in Portal for days after subscription being unsuspended. No actual error or banner displayed to the customer.

3. CSS telemetry

Services not being created, no telemetry at all for 3 different CRIs. CSS opened an emerging issue. No progress visible in orchestration logs.

4. Diagnosis & Fix

Undelete queue blocked by poison pills. With SRE Agent help, identified that SubscriptionLifecycleOrchestration runs 5 steps in order: (1) WarnSuspendedContainers, (2) SuspendWarnedContainers, (3) SuspendAndWarnActiveContainers, (4) TerminateContainersWhoseSubscriptionIsUnRegisteredOrDeleted, (5) ActivateContainersForSubscriptionWhichGotRegistered.

The orchestration was hardcoded to max 20 services per iteration. ~14 services were persistently failing to undelete for unrelated reasons (SKUv2 RG quota, P1v3 pool exhaustion). With consistently 12-16 failing per cycle + previous steps filling the bucket, no opportunity remained to process the undelete queue.

Fix: Kriti unblocked 7 services failing with P1v3. Team unblocked services failing with SQL size and SKUv2 RG quota to free space and catch up. Nina made a hotfix to skip poison pills so they no longer block the queue completely.

5. Monitoring Miss?

Yes — multiple gaps:
• No SLA or completion monitoring for undelete
• No check on how long a service has been waiting to be undeleted
• Undelete is buried inside SubscriptionLifecycleOrchestration as the last possible step
• SKUv2 preprovisioning RGs found with quota exhaustion (MSI 800/800, ServerFarms 100/100) — creating RGs and leaving them regardless of undelete success/failure

6. Repairs

Done: Skip hotfix rolled out behind beta feature (poison pill bypass)
Needed: Cleanup of quota-exhausted SKUv2 preprovisioning RGs
Needed: Decouple undelete from SubscriptionLifecycleOrchestration
Needed: SLA metric / Alert on undelete consistently failing

Feedback: (placeholder)
Work Items:
Done Skip hotfix (poison pill bypass) - rolled out behind beta featureTo Do Cleanup quota-exhausted SKUv2 preprovisioning RGsTo Do Decouple undelete from SubscriptionLifecycleOrchestrationTo Do SLA metric / Alert on undelete consistently failing

Key Themes - April 8 to May 4, 2026

ThemeIncidentsDescription
Version 0.51 Regressions787095677, 788655236, 775932060Grandfathered limits wiped + Dev Portal registration broken. ~30+ rollback CRIs generated. Largest customer impact.
Certificate Lifecycle (5)51000000976727, 21000000984572, 51000000994948, 51000001005891, 21000000995987Workspace certs not renewed, default domain incomplete chains (recurring x2), managed cert expiry. Systemic gap.
ForceUpgrade Bypass777722043, 778045793Feature flag bypassed SDP, deployed unsupported version. AOAI/Cognitive Services outage.
DotNetty Encoder (ACTIVE)789605450Intermittent 500s in request forwarding, AOAI-wide. 10-30 day steady-state before detection.
Capacity / Quota (3)779770640, 780717094, 51000001003804DNS quota, prepool exhaustion (~3 weeks undetected), scale-out. No proactive monitoring.
Per-VM Health (4)51000000976163, 51000000976845, 51000001002629, 51000001005278Infra degradation undetected by probes. Auto-healing slow (~16h). "Bad VM Variant B" pattern.
Stuck Operations (2)51000000982831, 783036395Update stuck 6+ hours (VMSS + Redis deadlock). Suspended services not recovering (undelete backlog).

Generated by Azure SRE Agent - May 5, 2026