Database Cluster Overload
Gelöst
Nov 28 um 11:32am CET
Post Mortem - Database Cluster Overload
What happened
Blockbrain experienced a temporary outage affecting our production database cluster.
The primary node of the cluster suffered a CPU overload, which led to partial unavailability and degraded performance for some customers.
The overload occurred while we were running a data ownership transfer script.
This script executed queries that read an entire collection with millions of records without proper indexing, causing excessive load on the database.
How we resolved it
We stopped the data ownership transfer script as soon as the impact was detected.
After stopping the script, the database cluster recovered automatically thanks to the mechanisms provided by our cloud database provider.
Preventive measures
We are implementing the following measures to avoid similar incidents:
- Proper indexing of the database
- We will add and optimize indexes on the affected collections.
- This will ensure that large queries can be executed efficiently without overloading the CPU.
- Safer query execution in the future
- Long-running or large queries will be executed in batches.
- We will ensure that queries on large datasets operate only on indexed fields.
Betroffene Dienste
Aktualisiert
Nov 28 um 11:31am CET
Web App recovered.
Betroffene Dienste
Erstellt
Nov 28 um 11:17am CET
Web App went down.
Betroffene Dienste