Skip to Main Content
Event Store Cloud Ideas
Status Shipped
Categories Operations
Created by Alexey Zimarev
Created on Jan 4, 2024

Improve Cloud alerts

Current ES Cloud alerts about ESDB and nodes health might be misleading. For example, high CPU load and memory consumption alerts are considered critical. However, those situations might be caused by temporary spikes in activity, and could be considered normal. It happens now also as those alerts get closed soon after they open.

The proposal is to tune alerting, and consider opening lower-level alerts for CPU and memory load for some time, and only promote those to high level alerts if the situation doesn't resolve in time.

High number of "critical" alerts creates strain on customers as they need to verify and act on any critical alert. Current situation causes people to mute and ignore noisy alerts, and they risk missing something really important like full disk alert, which is an indicator of a cluster to get down if the issue isn't resolved. Compared to low disk space alert, which indicates that the cluster will be dysfunctional if the alert is ignored, high CPU and memory load (as an example) situations only indicate potential performance degradation, but not lead to cluster going down.

Therefore, it is a good idea to evaluate severity of those alerts, and make a proper distinction between situations that can lead to clusters being defunct, and cases when some intervention might be required on the application side, but it won't take the cluster down.