Good metrics and alerting systems are designed, from the top down. Not bottom up.
Lots of metrics are typically available, but almost all of them are noise.
Start with the business: what is important to the business ? What kind of failures are existential threats ?
Then work your way down and design your metrics and alerts, instead of just throwing stuff at the wall.
I’ve had to push back so many times with teams whose manager at one point said “we need better monitoring / alerting” and they interpreted that to mean more metrics / alerts.
This is rarely the case.
I personally am really fond of just using a few alerts. The important thing to know that something went wrong. Not necessarily where / why / how something went wrong.
And yes, inertia is real, and false / invaluable alerts need to be killed immediately, without remorse. They are SRE’s cancer.
I agree that alerts should just be the vital ones. But in terms of monitoring and metrics, more is generally better. I joined a company where something broke and the only way to figure out what was wrong was to ssh and hop through several services and it was a massive waste of time for something that just having set up basic otel would be trivial to narrow down.
My solution to this is to have leveled alerts. Some are... recommendations, the ones which you look at with a glance to get a heads up about something being wrong. These are the ones which OP would claim cause alert fatigue, most likely.
Then I have a second level of this, the superpanic. Here is the "true" alert, which means "drop all things, fix this now". On every superpanic, there are stricter routines which intentionally cause friction, such as creating tickets about said superpanic, potentially hosting post mortems etc. This additional manual labour encourages tweaking the levels of the superpanic so that they sometimes are more lack, sometimes stricter, depending on the quality of the deployed services + the current load.
What signals a superpanic? Key valuable functionality being offline. Off-site uptime-checkers assuring that all primary domains resolve + serve traffic, mostly. Also crontime integration tests of core functionality. Stuff like that.
> there are stricter routines which intentionally cause friction, such as creating tickets
While this sounds sensible, in my experience it often becomes just a convoluted punishment for people involved in the alert firing. In general, people are lazy (sorry), and if alert makes them fill up post-mortem forms and attend mandatory late meetings with management why something got triggered for any reason - 99% of people will push to remove the alert altogether, or at least lower the priority. I haven't found a solution that doesn't include a complete overhaul of organization in the enterprise.
In my opinion the best method to reduce alerts is to work hard to get rid of the underlying problems or turn them into a non-problems. If you do a good job most errors are 3rd party driven, that can be indeed hard to solve relative to company politics. But at that point you can always tell your boss how it can be solved and that you wont go on pager duty for stuff that is out of your control.
Depends on what you are monitoring but let's assume an API endpoint.
Collect and monitor the RED metrics with detailed dimensions in combination with blackbox monitoring simulating client transactions as realistic possible and alert only on those 2 types.
When that happens, fire off a battery of diagnostic checks which you have collected over time to pinpoint the cause.
What if the diagnostics checks don't reveal the issue? There is still value since you know these are not the reason so no time is wasted re-evaluating them.
Where to get these diagnostic checks from? Well, what's the first thing responding engineers do? Open the CLI and troubleshoot. Those are your diagnostic checks. Collect, automate, capture the domain specific knowlegde and democratize it.
> Alerts should be actionable. If no action can or should be taken, then the alert is not needed.
Also, the best alerts come from looking at actual failures you had and not trying to make up "good alerts" from thin air. After you have an outage, figure out what alerts would have caught it, and implement those.
This is one category of good alerts, but not everything.
I think alerts are to ops what tests are to dev. You have "unit alerts" for some small thing like the disk usage on a single host, "integration alerts" like literally "does the page load?" and then what you describe are "regression alerts", trying to prevent something that went wrong once from going wrong again. These are great but just like you wouldn't have 100% regression tests, I think it's also smart to try to get ahead of failures and have some common sense alerts defined.
Also looking at failures others had, prior experience from yourself and others contribute to good alerts. You don't have to wait for failure to implement most of them. Most of that knowlege is also trained in to most LLM's nowadays. Just ask and then also verify sources, then implement. If you get to many alerts question if you needed them or if its noice. Its a constant trimming until you find the perfect alert setup.
ElasticSearch for example can be configured using ILM policies to fill up the disk then start deleting old records. I don't need to be woken up for disk filling up on those nodes.
Not all alerts are created equal. You should generally have three levels of alerts: critical (which pages somebody, time-to-fix should be ASAP), warning (creates a ticket, time-to-fix should be within a few days), and suspicious (does not notify, appear only on an alert dashboard). The suspicious alerts are there to help guide your investigation on a critical or warning alert.
Each critical and warning alert should link to an "interactive runbook" - a dashboard that combines text instructions along with graphs showing real-time data.
Doing this at scale, correctly, requires both alerts-as-code and dashboards-as-code, which almost nobody does because nobody treats higher-level configuration languages (jsonnet, CUE...) with the attention and respect they deserve /cries-in-yaml
I certainly agree in spirit that the alerts are important, and should be actionable. But I wouldn't start at just "looking at the service" and then trying to define the first set of alerts.
Instead I would move up a level and start with a SLO for the various "business level" metrics you might care about. Things like "request latency", "successful requests", etc.
Then use the longer lookahead "error budget" burndowns to see where your error budget is being spent, and from there decide 1.) if the SLO needs adjusting, and/or 2.) if an alert is appropriate.
To cleanly answer those questions and iterate you'll need metrics, dashboards, traces, and logs. So then you're not just making dashboards because "its best practice", you're creating them to specifically help you measure if you're meeting your stated service objectives.
SLO timelines are usually over 7d, 30d etc no? and also often don't work that great for backend services in my experience ... they can't give you the level of reactivity that defining alerts about things you care about give you. I'd argue that moving from that direction upwards to figure out what alerts to aggregate and define SLOs around, rather than the other way around in those cases.
I used to believe in alert fatigue, because you’re frequently told to repeat the line: if you have too many alerts, eventually everyone will stop paying attention to them.
I have tons of alerts at work. They go to specialized slack channels that I can look at if I need. We have on call escalation paths for critical ones and housekeeping duties for the ones that require engineers to perform a maintenance task. We have the hell channels that are 99.99% flapping, if you ever need that.
I find that observability in general has an extremely linear marginal reward curve, it basically always justifies the effort you put into setting it up.
I like the ideas, but either it’s entirely LLM written or the writer has internalized “LLM voice”. At this point that is more distracting than helpful.
The prompt: "take existing decades-old knowledge about best practices in setting up alerts and spin it into a multi-page article presenting it as somehow novel, to assist our submarine marketing efforts".
The issue is, there’s not a lot of meat in this article. Anyone who’s done any amount of SRE can perfectly articulate alert fatigue in way less words.
Yet the article doesn’t tackle at all the hard part: making alerts that are actually meaningful. They handwave it instead of giving actual advice. This post is a good intro, but I didn’t "walk away" with anything useful.
This is why, in this case, AI is important. Someone puts in an effort to write a short article (if a bit wordy) that can be used by e.g. beginners or managers? Good! I’m not the target audience. But if it’s the output of AI, what’s the intent?
The aricle is a marketing page under a "Winning with us" section right next to a "CEO Page" that describes a CEO pitch. I really don't think this article is very different from thousands of others like it that were published before AI.
> The real core of infrastructure monitoring isn’t dashboards. It’s the alerts.
“it’s not X it’s Y”
at this point when I see this pattern in writing I assume most if not all of it is AI generated - same with em-dashes.
This is not to discount the idea that alerts are more important than dashboards (I work directly in observability) - but just to say that I personally shut off reading anything else with these patterns because, generally speaking, the rest of the content is just not original or interesting.
I think that this is sad, because it is a useful pattern.
It is very frequent to find things about which a majority of the people wrongly believe that they are X, but in fact they are Y.
In such cases, you must point to them that "it's not X it's Y".
There are a few alternative ways to formulate this, but the alternatives are typically longer and more complex.
The same happens with em-dashes, which have valid uses and one should not care that there exist some people who are not familiar with the classic ways of using punctuation.
I do not believe that the right solution is to attempt to use more convoluted expressions or inappropriate punctuation in order to avoid to be accused of being a clanker.
I'm not sure in this case it's AI per se so much as a change over time.
At the first role I ever had 10+ years ago, we had a TV in our team's office space constantly showing our dashboard for our critical services and health. We still had alerting monitors but it felt like those alarms were for important issues (like sev-2 or worse).
the last couple roles I've had we don't constantly look at our dashboards unless our monitors keep ringing us with alerts. We have also had more monitors in general than the first role I mentioned. Occasionally if another team asks us if we're affected by something we'll look at the dashboards we have to make sure we don't have a monitoring gap.
I work writing analytics and monitoring for industrial equipment. We have hundreds of sensors sending back realtime data.
There was a period of time where people were writing alerts for the sake of it (i.e we have this sensor, when should we alert on it).
Nowadays we're strictly failure mode driven, this has meant lots of sensors aren't used in the analytics. They are however available to the experts to plot them for a more holistic view if required.
I work for a startup; we have what I think is a fairly typical setup: metrics ingested from a variety of sources, fed into industry-standard metrics/dashboard solutions, triggering escalations to humans. It's fine and I'm happy we have it, but...
The highest value source of alerting right now is one of our growth marketers who pays close attention to our CRM and product analytics tool and notices when key product funnels are underperforming.
Our next highest value signals are a handful of ad hoc alerting channels, mostly in Slack, either directly from a partner telling us that something suspicious happened on their side (think: fraud) or from in-product instrumentation sent to a channel for non-engineering visibility. Members of our business/product/operations team pay attention in these places and make decisions based on their business context.
After that, our support team is increasingly able to filter customer issues and differentiate between bugs, missing features, etc.
I know someone is going to argue that these are all a sign that we haven't instrumented the right things. Fair, but also misses the point. The decision makers in these flows don't (and won't) live in traditional alerting systems and wouldn't have helped us understand breakages without these other, ad hoc processes.
My theory is that it's relatively easy to offer a technical product that moves alerts around or that manages escalation paths. It's quite hard to design a product that surfaces detail to a non-technical export and that makes it easy to build systematic rules.
Good metrics and alerting systems are designed, from the top down. Not bottom up.
Lots of metrics are typically available, but almost all of them are noise.
Start with the business: what is important to the business ? What kind of failures are existential threats ?
Then work your way down and design your metrics and alerts, instead of just throwing stuff at the wall.
I’ve had to push back so many times with teams whose manager at one point said “we need better monitoring / alerting” and they interpreted that to mean more metrics / alerts.
This is rarely the case.
I personally am really fond of just using a few alerts. The important thing to know that something went wrong. Not necessarily where / why / how something went wrong.
And yes, inertia is real, and false / invaluable alerts need to be killed immediately, without remorse. They are SRE’s cancer.
I agree that alerts should just be the vital ones. But in terms of monitoring and metrics, more is generally better. I joined a company where something broke and the only way to figure out what was wrong was to ssh and hop through several services and it was a massive waste of time for something that just having set up basic otel would be trivial to narrow down.
If you receive too many emails, alerts, warnings, and so on, you are only training yourself and the team to ignore them.
As you say, few is better. And a well chosen few.
Very few alerts, implemented around core business logic, incorporating as many edge cases as possible. This is the way.
My solution to this is to have leveled alerts. Some are... recommendations, the ones which you look at with a glance to get a heads up about something being wrong. These are the ones which OP would claim cause alert fatigue, most likely.
Then I have a second level of this, the superpanic. Here is the "true" alert, which means "drop all things, fix this now". On every superpanic, there are stricter routines which intentionally cause friction, such as creating tickets about said superpanic, potentially hosting post mortems etc. This additional manual labour encourages tweaking the levels of the superpanic so that they sometimes are more lack, sometimes stricter, depending on the quality of the deployed services + the current load.
What signals a superpanic? Key valuable functionality being offline. Off-site uptime-checkers assuring that all primary domains resolve + serve traffic, mostly. Also crontime integration tests of core functionality. Stuff like that.
> there are stricter routines which intentionally cause friction, such as creating tickets
While this sounds sensible, in my experience it often becomes just a convoluted punishment for people involved in the alert firing. In general, people are lazy (sorry), and if alert makes them fill up post-mortem forms and attend mandatory late meetings with management why something got triggered for any reason - 99% of people will push to remove the alert altogether, or at least lower the priority. I haven't found a solution that doesn't include a complete overhaul of organization in the enterprise.
For prior art on how to define alert conditions, see:
https://en.wikipedia.org/wiki/Nelson_rules
https://en.wikipedia.org/wiki/Western_Electric_rules
https://en.wikipedia.org/wiki/Westgard_rules
Now we use purely statistical measures, which requires a probabilistic model. The name of the game is calibration.
In my opinion the best method to reduce alerts is to work hard to get rid of the underlying problems or turn them into a non-problems. If you do a good job most errors are 3rd party driven, that can be indeed hard to solve relative to company politics. But at that point you can always tell your boss how it can be solved and that you wont go on pager duty for stuff that is out of your control.
Depends on what you are monitoring but let's assume an API endpoint. Collect and monitor the RED metrics with detailed dimensions in combination with blackbox monitoring simulating client transactions as realistic possible and alert only on those 2 types.
When that happens, fire off a battery of diagnostic checks which you have collected over time to pinpoint the cause.
What if the diagnostics checks don't reveal the issue? There is still value since you know these are not the reason so no time is wasted re-evaluating them. Where to get these diagnostic checks from? Well, what's the first thing responding engineers do? Open the CLI and troubleshoot. Those are your diagnostic checks. Collect, automate, capture the domain specific knowlegde and democratize it.
> Alerts should be actionable. If no action can or should be taken, then the alert is not needed.
Also, the best alerts come from looking at actual failures you had and not trying to make up "good alerts" from thin air. After you have an outage, figure out what alerts would have caught it, and implement those.
This is one category of good alerts, but not everything.
I think alerts are to ops what tests are to dev. You have "unit alerts" for some small thing like the disk usage on a single host, "integration alerts" like literally "does the page load?" and then what you describe are "regression alerts", trying to prevent something that went wrong once from going wrong again. These are great but just like you wouldn't have 100% regression tests, I think it's also smart to try to get ahead of failures and have some common sense alerts defined.
"looking at actual failures you had "
Also looking at failures others had, prior experience from yourself and others contribute to good alerts. You don't have to wait for failure to implement most of them. Most of that knowlege is also trained in to most LLM's nowadays. Just ask and then also verify sources, then implement. If you get to many alerts question if you needed them or if its noice. Its a constant trimming until you find the perfect alert setup.
I know something is going to happen if disk space runs out; I don't need to experience it first.
Sure, but for every alert, there is exception.
ElasticSearch for example can be configured using ILM policies to fill up the disk then start deleting old records. I don't need to be woken up for disk filling up on those nodes.
Even worse is CPU/RAM alerts.
The number of times I've had to explain how the JVM heap works...
Alerts are for when things don't go as expected. You set up log rotation but an agent quietly breaks it or ES introduces a bug in it.
Not all alerts are created equal. You should generally have three levels of alerts: critical (which pages somebody, time-to-fix should be ASAP), warning (creates a ticket, time-to-fix should be within a few days), and suspicious (does not notify, appear only on an alert dashboard). The suspicious alerts are there to help guide your investigation on a critical or warning alert.
Each critical and warning alert should link to an "interactive runbook" - a dashboard that combines text instructions along with graphs showing real-time data.
Doing this at scale, correctly, requires both alerts-as-code and dashboards-as-code, which almost nobody does because nobody treats higher-level configuration languages (jsonnet, CUE...) with the attention and respect they deserve /cries-in-yaml
I certainly agree in spirit that the alerts are important, and should be actionable. But I wouldn't start at just "looking at the service" and then trying to define the first set of alerts.
Instead I would move up a level and start with a SLO for the various "business level" metrics you might care about. Things like "request latency", "successful requests", etc.
Then use the longer lookahead "error budget" burndowns to see where your error budget is being spent, and from there decide 1.) if the SLO needs adjusting, and/or 2.) if an alert is appropriate.
To cleanly answer those questions and iterate you'll need metrics, dashboards, traces, and logs. So then you're not just making dashboards because "its best practice", you're creating them to specifically help you measure if you're meeting your stated service objectives.
https://sre.google/sre-book/service-level-objectives/
SLO timelines are usually over 7d, 30d etc no? and also often don't work that great for backend services in my experience ... they can't give you the level of reactivity that defining alerts about things you care about give you. I'd argue that moving from that direction upwards to figure out what alerts to aggregate and define SLOs around, rather than the other way around in those cases.
I used to believe in alert fatigue, because you’re frequently told to repeat the line: if you have too many alerts, eventually everyone will stop paying attention to them.
I have tons of alerts at work. They go to specialized slack channels that I can look at if I need. We have on call escalation paths for critical ones and housekeeping duties for the ones that require engineers to perform a maintenance task. We have the hell channels that are 99.99% flapping, if you ever need that.
I find that observability in general has an extremely linear marginal reward curve, it basically always justifies the effort you put into setting it up.
I like the ideas, but either it’s entirely LLM written or the writer has internalized “LLM voice”. At this point that is more distracting than helpful.
The prompt: "take existing decades-old knowledge about best practices in setting up alerts and spin it into a multi-page article presenting it as somehow novel, to assist our submarine marketing efforts".
Do we really need this comment for every article? Who cares if AI wrote it if you like the ideas?
The issue is, there’s not a lot of meat in this article. Anyone who’s done any amount of SRE can perfectly articulate alert fatigue in way less words.
Yet the article doesn’t tackle at all the hard part: making alerts that are actually meaningful. They handwave it instead of giving actual advice. This post is a good intro, but I didn’t "walk away" with anything useful.
This is why, in this case, AI is important. Someone puts in an effort to write a short article (if a bit wordy) that can be used by e.g. beginners or managers? Good! I’m not the target audience. But if it’s the output of AI, what’s the intent?
The aricle is a marketing page under a "Winning with us" section right next to a "CEO Page" that describes a CEO pitch. I really don't think this article is very different from thousands of others like it that were published before AI.
Huh, thanks, I failed to see it on my phone. Down the trash the article goes then.
I care.
Nobody cares that you care, because you're not adding anything to the conversation.
I’m answering a direct question posed by the parent post. What’s your excuse?
I wrote the parent post. It was a rhetorical question.
But it did not get the answer you expected. So a failed rhetorical question.
> The real core of infrastructure monitoring isn’t dashboards. It’s the alerts.
“it’s not X it’s Y”
at this point when I see this pattern in writing I assume most if not all of it is AI generated - same with em-dashes.
This is not to discount the idea that alerts are more important than dashboards (I work directly in observability) - but just to say that I personally shut off reading anything else with these patterns because, generally speaking, the rest of the content is just not original or interesting.
I think that this is sad, because it is a useful pattern.
It is very frequent to find things about which a majority of the people wrongly believe that they are X, but in fact they are Y.
In such cases, you must point to them that "it's not X it's Y".
There are a few alternative ways to formulate this, but the alternatives are typically longer and more complex.
The same happens with em-dashes, which have valid uses and one should not care that there exist some people who are not familiar with the classic ways of using punctuation.
I do not believe that the right solution is to attempt to use more convoluted expressions or inappropriate punctuation in order to avoid to be accused of being a clanker.
I'm not sure in this case it's AI per se so much as a change over time.
At the first role I ever had 10+ years ago, we had a TV in our team's office space constantly showing our dashboard for our critical services and health. We still had alerting monitors but it felt like those alarms were for important issues (like sev-2 or worse).
the last couple roles I've had we don't constantly look at our dashboards unless our monitors keep ringing us with alerts. We have also had more monitors in general than the first role I mentioned. Occasionally if another team asks us if we're affected by something we'll look at the dashboards we have to make sure we don't have a monitoring gap.
I work writing analytics and monitoring for industrial equipment. We have hundreds of sensors sending back realtime data.
There was a period of time where people were writing alerts for the sake of it (i.e we have this sensor, when should we alert on it).
Nowadays we're strictly failure mode driven, this has meant lots of sensors aren't used in the analytics. They are however available to the experts to plot them for a more holistic view if required.
I have some thoughts here.
I work for a startup; we have what I think is a fairly typical setup: metrics ingested from a variety of sources, fed into industry-standard metrics/dashboard solutions, triggering escalations to humans. It's fine and I'm happy we have it, but...
The highest value source of alerting right now is one of our growth marketers who pays close attention to our CRM and product analytics tool and notices when key product funnels are underperforming.
Our next highest value signals are a handful of ad hoc alerting channels, mostly in Slack, either directly from a partner telling us that something suspicious happened on their side (think: fraud) or from in-product instrumentation sent to a channel for non-engineering visibility. Members of our business/product/operations team pay attention in these places and make decisions based on their business context.
After that, our support team is increasingly able to filter customer issues and differentiate between bugs, missing features, etc.
I know someone is going to argue that these are all a sign that we haven't instrumented the right things. Fair, but also misses the point. The decision makers in these flows don't (and won't) live in traditional alerting systems and wouldn't have helped us understand breakages without these other, ad hoc processes.
My theory is that it's relatively easy to offer a technical product that moves alerts around or that manages escalation paths. It's quite hard to design a product that surfaces detail to a non-technical export and that makes it easy to build systematic rules.