502 Gateway Error Reports

Incident Report for Brillium

Postmortem

Background

On June 13 there was an outage event (very similar to the one on June 8), occurring between approximately 1300 UTC and 1800 UTC. This event only affected a portion of our customers.

Review of our data indicated that specific Amazon AWS cloud service communications between our server systems and authentication resources were failing sporadically during this period, and at times appeared to fail altogether. Our monitoring and system information shows that our systems repeatedly tried to communicate with external resources without success.

All systems became operational later in the day.

Unlike the previous event on June 8, this one was publicly reported by a few outlets and users on social media. Amazon’s own services appear to have been impacted by the event.

Steps taken:

We communicated our status and findings, as we learned them, to all customers via our status page
We provided additional details to customers through direct support communications.
We contacted Amazon AWS Support for additional information and confirmation of our findings.
We continued to closely monitor systems for a period after the event, to ensure ongoing stability.

Public Reports

Public reports of the Amazon outage can be found via the links below and these were shared with some customers:

Mitigation

As this event was external to our systems and outside of our control, no direct actions could have predicted or prevented the issues.

External network or Internet issues can and will affect access to our systems; however, these types of events are generally rare and often quickly resolved.

Posted Jun 19, 2022 - 18:14 EDT

Resolved

Monitoring shows that the issues have been addressed. A full report will be shared via the incident Post-Mortem.

Posted Jun 13, 2022 - 22:52 EDT

Monitoring

We are continuing to monitor activity. Initial results and testing are positive.

Posted Jun 13, 2022 - 15:18 EDT

Update

Brillium Systems are currently back in service. Our internal tests show the external networking is presently stable. We will continue to monitor.

Posted Jun 13, 2022 - 14:38 EDT

Update

Investigations continue. Currently, it does look like there is an issue with the network routing connecting our systems that is causing sporadic issues. Reports have begun to surface that other Amazon systems are experiencing downtime as well, although we do not know the extent of such issues or the relationship to our specific issue.

As these systems are outside of our control, our attempted workaround(s) did not sufficiently address the problem. We are currently assisting engineers in further diagnosis any way that we can, in an attempt to help address the issue in the most expedient way.

Posted Jun 13, 2022 - 14:19 EDT

Update

The system operations team is currently consulting with AWS engineers to determine the root cause of the issue.

Posted Jun 13, 2022 - 13:35 EDT

Update

We are receiving intermittent reports that some users are receiving 503 errors.

Posted Jun 13, 2022 - 12:19 EDT

Update

We are continuing to investigate this issue.

Posted Jun 13, 2022 - 12:17 EDT

Update

We have implemented a workaround to these issues. The systems are available and we continue to investigate and gather information surrounding the root cause.

Posted Jun 13, 2022 - 12:02 EDT

Update

The systems operations team is provisioning a possible workaround to the issue, while they continue to investigate. An additional status update will be posted in approximately 15 minutes.

Posted Jun 13, 2022 - 11:27 EDT

Update

We are continuing to investigate the issue

Posted Jun 13, 2022 - 11:03 EDT

Update

The system operations team is investigating whether an operating system level error used by some of the AWS cloud systems is potentially affecting services. Currently, the system monitoring of Brillium services does not indicate an issue that would be unusual.
Although this may lay outside of our direct control, the team is investigating any opportunity to mitigate the effect from our side.
We continue to investigate.

Posted Jun 13, 2022 - 10:43 EDT

Investigating

We are currently investigating 502 Gateway errors reported by a portion of our customers.

Posted Jun 13, 2022 - 10:39 EDT

This incident affected: API & Integrations (API), Administration (User Administration and Authentication, Partner Central Custom Administration), and Assessment Builder (Assessment Authoring).