AWS Service Disruption Affecting Brilium Services

Incident Report for Brillium

Postmortem

AWS Regional Outage (October 20, 2025)

1. Executive Summary

On October 20, 2025, Brillium services experienced a significant disruption lasting approximately 4.5 hours due to an external, widespread regional outage within Amazon Web Services (AWS). The incident began at 3:00 AM EST and primarily impacted service availability and performance for customers relying on the affected AWS region. The core issue was external to Brillium’s platform. Our focus during the incident was on confirmation, communication, and swift restoration, which was completed by 7:30 AM EST after AWS reported their upstream resolution.

2. Key Details

Metric Detail
Incident Name AWS Regional Service Disruption
Date October 20, 2025
Duration 4 hours, 30 minutes (3:00 AM EST to 7:30 AM EST)
Impacted Services All core Brillium services (including API, Data Processing, and Web Frontend)
Root Cause Widespread regional outage in AWS (External)
Resolution Status Fully Resolved

3. Impact Analysis

During the incident window (3:00 AM - 6:00 AM EST), customers experienced:

  • Service Unavailability: Difficulty accessing or connecting to various Brilium applications.
  • Performance Degradation: Increased latency and intermittent timeouts when services were partially available.
  • Data Processing Delays: Backend processing queues were backed up, leading to delays in scheduled tasks and data updates.

The primary customer impact was loss of service availability for the duration of the upstream AWS outage.

4. Root Cause

The root cause was confirmed to be a major service disruption impacting a critical AWS region upon which a portion of Brillium’s infrastructure relies. This was an external failure of the cloud provider’s infrastructure.

  • Brillium Action: The incident was immediately confirmed via AWS status pages and internal monitoring systems.
  • External Cause: An initial AWS failure (e.g., networking or power event) cascaded across availability zones within the region.

5. Incident Timeline (All Times EST)

Time Event
3:00 AM Internal monitoring alerts triggered across multiple Brillium services. Incident declared.
3:15 AM External AWS status page confirms a major regional incident affecting multiple services.
3:30 AM Initial customer status update posted to status.brilium.com identifying the external AWS issue.
6:00 AM AWS reports resolution of the underlying issue, and Brilium systems begin self-recovering.
6:15 AM Service restoration update posted; Brillium enters extensive monitoring phase.
6:30 AM Brillium monitoring confirms all services are stable, running within normal parameters, and fully functional. Final resolution update posted.

6. Corrective Actions and Lessons Learned

While the root cause was external, we identified opportunities to improve our monitoring and response to similar external events:

Area Action Item Target Date
Alerting Enhance specific alerting thresholds to differentiate between high load/internal issues and sudden, widespread external availability failures. End of Q2 2026
Communication Create pre-drafted status page templates for common external dependency failures (e.g., AWS, other third-party providers) to expedite initial communication. Immediate
Monitoring Implement synthetic transactions (probes) in a secondary, unaffected region to quickly confirm global service health during local regional outages. Q4 2025

We appreciate the patience of our customers during this disruption and are committed to implementing these actions to enhance the resilience of the Brillium platform.

Posted Oct 20, 2025 - 21:12 EDT

Resolved

Final Resolution: Full Service Restoration and Normal Operation

Current Status: Resolved

Time of Final Confirmation: Approximately 6:30 AM EST

We are happy to confirm the full and sustained resolution of the earlier service disruption caused by the Amazon Web Services (AWS) regional outage.

Our intensive monitoring period has concluded, and we have verified that:
• All Brilium services are fully operational.
• All systems are running within normal performance parameters.
• There are no residual effects or lingering issues from the external AWS incident.

We consider this incident closed.

Thank you once again for your patience and understanding during this unplanned disruption. We appreciate your reliance on Brilium services and remain committed to providing you with reliable performance.
Posted Oct 20, 2025 - 06:32 EDT

Update

Current Status: Resolved / Monitoring
Time of Resolution: Approximately 6:00 AM EST
We are pleased to report that the widespread Amazon Web Services (AWS) regional issue that began around 3:00 AM EST appears to be resolved by AWS.

We are now seeing a steady return to normal operations and full service availability across all affected Brilium services.

Next Steps & What We Are Doing Now:
• Our engineering team has confirmed that all core services and customer-facing features are back online and operational.
• We are now in an extensive monitoring and stabilization phase. We will continue to closely watch system performance and metrics over the next several hours to ensure complete stability and prevent any potential lingering effects.

We know this outage caused significant disruption, and we sincerely appreciate your patience and understanding throughout this external incident.

If you continue to experience any unusual issues with a specific Brilium service, please do not hesitate to reach out to our support team.

We will provide one final wrap-up report once we have completed the monitoring phase and are 100% confident in the full, sustained restoration of service.
Posted Oct 20, 2025 - 06:10 EDT

Monitoring

Time of Initial Impact: Approximately 3:00 AM EST
We are writing to inform you that Brilium services are currently experiencing an impact due to a widespread Amazon Web Services (AWS) regional outage that began around 3:00 AM EST this morning.
This AWS disruption is affecting the availability and performance of various Brilium services. Our engineering and operations teams were immediately alerted and are working diligently to assess the full scope of the impact on our infrastructure.
What We Are Doing:
• We are in close and continuous contact with AWS to gather the latest updates on their progress.
• Our teams are actively exploring and implementing potential mitigating steps where possible.
• We are preparing for a swift return to full operational status once the underlying AWS issue is resolved.
We understand the criticality of our services to your operations and sincerely apologize for the inconvenience and disruption this unplanned outage is causing.
We are committed to providing you with regular updates and will post a new notification as soon as we have substantive information from AWS or when the incident is resolved.
Thank you for your patience and understanding as we navigate this external issue.
Next Update Expected: 15 minutes
Posted Oct 20, 2025 - 03:30 EDT
This incident affected: API & Integrations (API, Zapier Integration), Administration (User Administration and Authentication, Partner Central Custom Administration), Assessment Builder (Assessment Authoring, Assessment Delivery), and Talent (Invitation Management, Recruiter & Candidate Management).