Service disruption

Incident Report for Castle

Postmortem

On March 30, 2021, Castle’s API became degraded during three distinct windows of time:

12:02 UTC - 12:45 UTC
12:59 UTC - 13:41 UTC
14:48 UTC - 15:25 UTC

During this time, some Castle API calls failed, including calls to our synchronous authenticate endpoint. The Castle dashboard was up, but due to the API being unavailable was not rendering data. Service was fully restored as of 15:25 UTC, and some data generated from requests to our asynchronous track and batch endpoints during the incident was recovered from queues and subsequently processed.

As we communicated to all active customers yesterday, we take these sort of incidents very seriously, and want to share some of the factors that led to this incident.

The root cause of the incident was a failure of one of our primary data clusters. This is a multi-node, fault-tolerant commercial solution and a complete cluster failure is extremely rare. Castle’s infrastructure team responded immediately to the incident and found an unbounded memory leak that caused each node to simultaneously shut down. Over the course of the incident, we learned this memory leak was exacerbated by a specific class of background job that actually began running a day prior but did not begin leaking memory for some time.

When the incident began, we detected the issue and immediately restarted the cluster. A full 'cold start' of the entire cluster takes around 40 minutes, and this accounts for the first downtime window. After the cluster restarted, our fault-tolerant job scheduling system attempted to run the jobs again, which caused the cluster to require full cold restarts twice more as we worked to clear out the job queue and replicas.

At this time, we believe the reason for the memory leak is a bug in our data cluster provider’s software - we have been able to successfully reproduce the issue in a test environment and have a high-priority case open with their support team. In the meantime, we have audited all active background job systems to ensure performance-affecting jobs are temporarily disabled or worked around.

Once again, we apologize for the impact of this interruption. Please feel free to contact us at support@castle.io if you have any further questions.

Posted Mar 31, 2021 - 16:57 UTC

Resolved

Systems are operating normally and we have put mitigation measures in place to ensure the issue does not reoccur. We'll have a full retrospective and root cause teardown of the incident published within the next few days.

Posted Mar 30, 2021 - 16:28 UTC

Monitoring

API endpoints are responsive again and the system is stabilizing. We're monitoring the situation

Posted Mar 30, 2021 - 15:41 UTC

Identified

We are seeing degraded performance on API endpoints once more, and are working on restoring functionality as quickly as possible.

Posted Mar 30, 2021 - 14:53 UTC

Monitoring

Database cluster operating normally and API endpoints are responding. We're continuing to monitor the situation

Posted Mar 30, 2021 - 13:42 UTC

Investigating

We are experiencing issues with our main database cluster which affects all API endpoints. We're currently investigating this issue.

Posted Mar 30, 2021 - 12:09 UTC

This incident affected: Dashboard and Legacy APIs.