Below I have listed the five sources which can create alert fatigue for your SRE team and can stop them from attending real issues.
Services not under use, project which have been decommissioned, system components lying idle, all of these contribute to creating irrelevant alerts. It is important to turn off the alerts at the source. The alerts can trigger sets of notifications that jam your inbox and can come from various kinds of tools and systems that had been erstwhile employed in the projects or services that have been laid off. Periodic audit of infrastructure for finding out all such and putting off the alerts can help the SRE team get off the alert fatigue
Less priority alerts
Some alerts are just for providing more context for preemptive incident management and are not directly related to core systems availability and performance. In short terms or day to day work these alerts don’t add any value but can be recorded and configured for identifying the root cause of many incidents and events.
An alert is said to be flapping when it changes states multiple times in an hour. A specific event repeatedly triggers the same alert in a short time span creating distraction for your SRE team. Though these alerts are indicative of the growing problems with your systems but they cause unrelated issues to pile up in flapping alerts notifications often hiding important issues.
Duplicate alerts or same alerts for the events are a cause of distraction for the SRE team. It is basically an outcome of faulty monitoring configuration of alerts.
To determine whether an alert is good for the SRE team, it needs to be accessed upon the four parameters
Arrival on Time– It needs to be checked whether the alert arrived on time or arrived too long after the event to be considered useful
Delivery– It needs to be checked whether the alert was routed or delivered to the correct team or personnel concerned with solving the problem
Alert description– It needs to be assessed whether the alert description was helpful and clearly described the incident and the resolution steps to be taken or whether it was generic and unhelpful
Actionable– It needs to be assessed whether the alert was something that the team or the SRE engineer worked on or it was just acknowledged by the engineer.
The SRE team for effective functioning can segregate the alerts into
- Reactive alerts– This generally comprises of SLA based alerts. These alerts are triggered when your business objectives are in immediate risk
- Proactive alerts– These alerts are triggered if your business goals are in risk in the future
- Investigative alerts– These alerts are triggered to help ward off immediate risks and failure of system and compromise of some of the future business objectives
Unactionable alerts lead to burnout of your SRE engineers and creates a lot of noise
Whatever the alerting systems, an SRE team needs to ensure that all the systems and processes are monitored as a whole across the four golden signals of latency, traffic, errors and saturation.