WEEK 3 - April 22-27 | Sev2 CRI: 7 | RAs: 2

Incident / DRI / StatusQuestions & AnswersFeedback & Work Items
INC 787095677
Sev2 Mitigated
Outage Declared

Title: Grandfathered Limits not applied after 0.51
DRI: Macko
Created: 2026-04-27
Impact: 2026-04-27 02:00
Ticket: 2604270050003314

1. Customer Complaint

Multiple customers with grandfathered limits hitting operation limits after 0.51.27763.0 upgrade.

2. What customer saw

"Youve reached the maximum number of operations" despite grandfathered limits. API creation/update blocked.

3. CSS telemetry

Custom settings storing limits wiped during 0.51 upgrade. Version 0.50 honored them correctly.

4. Diagnosis & Fix

Code Bug / RP - 0.51 didnt honor custom limits. Rollback to 0.50 via "APIMSKUV1Rollback", quarantine, fix in 0.52. Ashendre applied CustomSetting Upgrade Apr 28.

5. Monitoring Miss?

Yes - no test for limits persistence across upgrades.

6. Repairs

Fix in 0.52. Integration tests. Alert on "max operations" spike post-upgrade.

Feedback: (placeholder)
Work Items:
None linked
INC 21000000998761
Sev2 Resolved

Title: APIM service is down
DRI: Macko
Created: 2026-04-25

1. Customer Complaint

Customer reported APIM service completely down.

2. What customer saw

Service unreachable, all traffic returning 500s.

3. CSS telemetry

Auto OS rolling upgrade triggered destructive VMSS model update from scale-out. All VMs lost Redis. Health monitor deadlocked. 8x traffic ramp + pre-existing Redis errors contributed.

4. Diagnosis & Fix

Scale-out triggered destructive VMSS upgrade. ~7h outage. Macko mitigated with Reboot Apr 25.

5. Monitoring Miss?

Yes - cascading failure not caught until customer reported.

6. Repairs

Rolling upgrade guardrails. Redis resilience. Reboot as preferred first mitigation. Known pattern documented.

Feedback: (placeholder)
Work Items:
None
INC 51000000994948
Sev2 Resolved

Title: Front Door 502 - invalid cert chain
DRI: Macko
Created: 2026-04-23
Note: Recurred as INC 51000001005891

1. Customer Complaint

Front Door returning 502. Incomplete certificate chain on APIM default domain.

2. What customer saw

HTTP 502. Missing intermediate CA certs.

3. CSS telemetry

After OS upgrade, VMs stopped presenting full cert chain. Internal VNET with port 80 blocked preventing CRL/AIA fetching.

4. Diagnosis & Fix

Gateway (Managed) - incomplete chain after OS upgrade on VNET services with port 80 blocked. Macko resolved Apr 23.

5. Monitoring Miss?

Yes - no cert chain completeness monitoring.

6. Repairs

Full chain always served. Cert chain alerting. VNET dependency checks for CRL/AIA.

Feedback: (placeholder)
Work Items:
None
INC 21000000995987
Sev2 Active
Downgraded Sev3

Title: SSL CERTIFICATE_VERIFY_FAILED
DRI: Martin
Created: 2026-04-23
Customer: Network Rail (Mission Critical)

1. Customer Complaint

SSL certificate verification failures on backend calls.

2. What customer saw

SSL CERTIFICATE_VERIFY_FAILED intermittently.

3. CSS telemetry

Transferred to Platform for deeper RCA.

4. Diagnosis & Fix

Active - Martin investigating.

5. Monitoring Miss?

TBD.

6. Repairs

TBD.

Feedback: (placeholder)
Work Items:
None
INC 785535632
Sev2 Resolved

Title: V2 SKUs Management 502s
DRI: Macko
Created: 2026-04-24
Impact: 2026-04-23 01:30
Tickets: 2604240030005476, 2604240050001721

1. Customer Complaint

Multiple EA customers - intermittent 502 on management plane for V2 SKUs across regions since ~Apr 21.

2. What customer saw

Management API calls returned 502. Retries sometimes succeeded.

3. CSS telemetry

HttpIncomingRequests: 502 spike from Apr 21, correlated with SKUv2 deployment. Related: 51000000995495, 51000000996243, 784788127.

4. Diagnosis & Fix

SMAPI Regression. Macko rolled back SKUv2 Apr 27.

5. Monitoring Miss?

Yes - 3-day gap between spike start and declaration.

6. Repairs

Fix regression. Lower threshold for management 502 alert.

Feedback: (placeholder)
Work Items:
None linked
INC 51000000996010
Sev2 Active
Sev 2->3: customer issue

Title: HM Electronics SSL CERT_VERIFY_FAILED
DRI: Macko
Created: 2026-04-23

1. Customer Complaint

HM Electronics (Sev A, Premium) - intermittent SSL cert verification failures.

2. What customer saw

Intermittent SSL CERT_VERIFY_FAILED on backend calls.

3. CSS telemetry

Classified as customer issue.

4. Diagnosis & Fix

Customer issue. Downgraded to Sev3.

5. Monitoring Miss?

N/A.

6. Repairs

N/A.

Feedback: (placeholder)
Work Items:
None
INC 51000000996953
Sev2 Active
Sev 2->3: reporting issue

Title: APIM does not scale out
DRI: Martin
Created: 2026-04-24

1. Customer Complaint

Customer reported unable to scale out.

2. What customer saw

Scale-out appeared broken.

3. CSS telemetry

Customer actually has 60 instances. Problem is orchestration logs/container size reporting.

4. Diagnosis & Fix

Not a scaling issue - reporting/logging problem. Martin investigating.

5. Monitoring Miss?

No (reporting issue).

6. Repairs

Fix orchestration log reporting.

Feedback: (placeholder)
Work Items:
None
RA - INC 785238144
Sev2 RA

Title: Speech resources scale
Date: 2026-04-25
Requester: AOAI
Bridge: Ajinkya

~8-10% failed connections via private endpoint. ProxyRequest shows traffic reaching gateway, no server_5xx. Issue likely upstream (ExpressRoute).

Feedback: (placeholder)
None
RA - INC 787253075
Sev2 RA

Title: AOAI Model Group Unhealthy
Date: 2026-04-28
Requester: Foundry/Hyena
Bridge: (no show)

Requester didnt respond. Later: signs of recovery, DNS error fixed, RA no longer needed.

Feedback: (placeholder)
None