On October 20, 2025, Brillium services experienced a significant disruption lasting approximately 4.5 hours due to an external, widespread regional outage within Amazon Web Services (AWS). The incident began at 3:00 AM EST and primarily impacted service availability and performance for customers relying on the affected AWS region. The core issue was external to Brillium’s platform. Our focus during the incident was on confirmation, communication, and swift restoration, which was completed by 7:30 AM EST after AWS reported their upstream resolution.
| Metric | Detail |
|---|---|
| Incident Name | AWS Regional Service Disruption |
| Date | October 20, 2025 |
| Duration | 4 hours, 30 minutes (3:00 AM EST to 7:30 AM EST) |
| Impacted Services | All core Brillium services (including API, Data Processing, and Web Frontend) |
| Root Cause | Widespread regional outage in AWS (External) |
| Resolution Status | Fully Resolved |
During the incident window (3:00 AM - 6:00 AM EST), customers experienced:
The primary customer impact was loss of service availability for the duration of the upstream AWS outage.
The root cause was confirmed to be a major service disruption impacting a critical AWS region upon which a portion of Brillium’s infrastructure relies. This was an external failure of the cloud provider’s infrastructure.
| Time | Event |
|---|---|
| 3:00 AM | Internal monitoring alerts triggered across multiple Brillium services. Incident declared. |
| 3:15 AM | External AWS status page confirms a major regional incident affecting multiple services. |
| 3:30 AM | Initial customer status update posted to status.brilium.com identifying the external AWS issue. |
| 6:00 AM | AWS reports resolution of the underlying issue, and Brilium systems begin self-recovering. |
| 6:15 AM | Service restoration update posted; Brillium enters extensive monitoring phase. |
| 6:30 AM | Brillium monitoring confirms all services are stable, running within normal parameters, and fully functional. Final resolution update posted. |
While the root cause was external, we identified opportunities to improve our monitoring and response to similar external events:
| Area | Action Item | Target Date |
|---|---|---|
| Alerting | Enhance specific alerting thresholds to differentiate between high load/internal issues and sudden, widespread external availability failures. | End of Q2 2026 |
| Communication | Create pre-drafted status page templates for common external dependency failures (e.g., AWS, other third-party providers) to expedite initial communication. | Immediate |
| Monitoring | Implement synthetic transactions (probes) in a secondary, unaffected region to quickly confirm global service health during local regional outages. | Q4 2025 |
We appreciate the patience of our customers during this disruption and are committed to implementing these actions to enhance the resilience of the Brillium platform.