Gelöst
Nov 28 um 11:32am CET

Post Mortem - Database Cluster Overload

What happened

Blockbrain experienced a temporary outage affecting our production database cluster.
The primary node of the cluster suffered a CPU overload, which led to partial unavailability and degraded performance for some customers.
The overload occurred while we were running a data ownership transfer script.
This script executed queries that read an entire collection with millions of records without proper indexing, causing excessive load on the database.

How we resolved it

We stopped the data ownership transfer script as soon as the impact was detected.
After stopping the script, the database cluster recovered automatically thanks to the mechanisms provided by our cloud database provider.

Preventive measures

We are implementing the following measures to avoid similar incidents:

Proper indexing of the database
- We will add and optimize indexes on the affected collections.
- This will ensure that large queries can be executed efficiently without overloading the CPU.
Safer query execution in the future
- Long-running or large queries will be executed in batches.
- We will ensure that queries on large datasets operate only on indexed fields.

Aktualisiert
Nov 28 um 11:31am CET

Web App recovered.

Erstellt
Nov 28 um 11:17am CET

Web App went down.