502 Gateway Timeout Error

Incident Report for Brillium

Postmortem

Background

On June 8 there was an event that affected service to all customers between approximately 1300 – 1400 UTC. Immediate review of our monitoring and systems data suggested that the network between our systems and database services suddenly stopped functioning. Our systems repeatedly tried to reestablish communication with these resources without immediate success. The systems once again became operational after approximately ~1 hour.

Further review of available data indicated communications between our systems and other Amazon AWS cloud services were not reliably functioning during this time. This presented itself as sporadic errors and/or temporary communication failures between key systems. As these network systems are not within our domain of control, we do not have access to the detailed data that can confirm our conclusions with absolute certainty. Previously, our systems were operating normally until network communication was suddenly lost.

Subsequently, we reported this to Amazon AWS support, which later advised that there were elevated API latency issues around the period of this event. While this does align with what our evidence indicates, Amazon AWS Support was not immediately able to positively confirm that these were in fact the specific issues that we experienced. We have requested additional information.

Steps taken:

We communicated our status and findings, as we learned them, to all customers via our status page.
We provided additional details to customers through direct support communications.
We contacted Amazon AWS Support for additional information and confirmation of our findings.
We continued to closely monitor systems for a period after the event, to ensure ongoing stability.

Mitigation

As this event was external to our systems and outside of our control, no direct actions could have predicted or prevented the issues.

External network or Internet issues can and will affect access to our systems; however, these types of events are generally rare and often quickly resolved.

Posted Jun 19, 2022 - 18:03 EDT

Resolved

This incident has been resolved.

Posted Jun 08, 2022 - 21:39 EDT

Update

We are continuing to monitor for any further issues. At present, it continues to appear that this was caused by something related to some of the AWS systems we use. We will be confirming this with them and providing updates within the next 48 hours or so.

Posted Jun 08, 2022 - 10:17 EDT

Monitoring

Ongoing systems monitoring has begun. Systems are operating normally.

Posted Jun 08, 2022 - 10:13 EDT

Update

All systems are recovered. We have begin monitoring operations while we gather additional information.

Posted Jun 08, 2022 - 10:10 EDT

Update

All systems are being tested now (smoke test). We expect an update within a few moments.

Posted Jun 08, 2022 - 10:06 EDT

Update

We have begun and are continuing to bring up affected customers. Currently we estimate approximately 5 more minutes to address remaining affected instances.

Posted Jun 08, 2022 - 09:42 EDT

Update

We estimate approximately 5 minutes to address the issue. We are continuing to identify the root cause of the issue. At present, it appears to be a related to the AWS database system infrastructure.

Posted Jun 08, 2022 - 09:16 EDT

Identified

The issue has been identified and the system operations team is working with the AWS team to remedy the situation.

Posted Jun 08, 2022 - 09:10 EDT

Investigating

We are currently investigating reports that some customers are unable to access their Brillium assessment Builder instance and receiving a "502 Bad Gateway" error

Posted Jun 08, 2022 - 09:09 EDT

This incident affected: API & Integrations (API), Administration (User Administration and Authentication, Partner Central Custom Administration), and Assessment Builder (Assessment Authoring).