You Don't Have an Alerting Problem, You Have an Ownership Problem

March 2026 · 7 min read

Every team I've talked to thinks they have an alert fatigue problem. Too many alerts, too much noise, on-call is miserable. The usual fix is to "tune the alerts" - raise some thresholds, silence some monitors, maybe consolidate a few notification channels.

It helps for about two weeks. Then you're right back where you started.

That's because alert fatigue is a symptom. The actual problem is that nobody owns the monitors. Not in a "we should assign owners" process way - in a "this specific alert fired and nobody on the team knows why it exists or what to do about it" way.

The ghost monitors

Here's a scenario that plays out at basically every company. An engineer joins, works on a service for a year, sets up a bunch of monitors for it. Good thresholds, good notification channels, good runbooks. Then they leave.

Six months later, those monitors are still running. The thresholds are based on traffic patterns from a year ago. The notification channel goes to a Slack room that the new team barely checks. The runbook references infrastructure that got migrated to Kubernetes three months ago.

But the monitors still fire. And every time they do, someone on the team glances at it, doesn't recognize it, assumes it's noise, and ignores it. That's not alert fatigue. That's an orphaned monitor doing exactly what it was told to do, with nobody around who understands it.

I've seen teams with 200+ monitors where fewer than half had anyone who could explain what the threshold was or why it was set there. The rest were ghosts - still firing, still paging, but functionally invisible.

Copy-paste monitoring

This is the other big one. New service gets spun up, someone looks at how the last service was monitored, copies the configs, changes the service name, and ships it. Done. Monitoring is "set up."

Except the thresholds are wrong. The original service handles 10x the traffic, so its error rate threshold of 5% makes sense. The new service gets maybe 100 requests a day - 5% is 5 errors total. That's one bad deploy, not an outage. But the monitor fires anyway, and now you've trained the team to ignore it.

The worse version of this is when someone copies a monitor that was already misconfigured. Now you've got the same bad threshold propagating across services like a virus, and every instance of it is teaching people that alerts don't matter.

Why nobody fixes them

Here's the thing - everyone on the team knows which alerts are noisy. They'll tell you if you ask. "Oh yeah, that one fires every Tuesday during the batch job, just ignore it." Or "that threshold is way too low, it's been on my list to fix."

But it never gets fixed because it's nobody's job. There's no "monitor owner" field in Datadog. There's no quarterly review where someone asks "does this alert still make sense?" There's no process at all. Monitors get created and then they just... exist forever.

The "not my alert" problem

Even when alerts do fire for real issues, ownership matters. I've watched incident channels where an alert comes in and the first five minutes are just people figuring out who should look at it. The on-call person sees it, but it's for a service they don't work on. They ping the team that owns the service, but that team's on-call doesn't have access to the monitoring dashboard. By the time someone with the right context is actually looking at the problem, you've burned ten minutes.

This happens because the alert routing was set up based on who was available, not who could actually fix the issue. The escalation policy says "page the platform team," but the platform team doesn't own the application logic that's failing. They can see the CPU is fine, the database is fine, and they close the alert. Meanwhile, the checkout flow is broken.

Ownership means the person who gets paged can actually do something about it. If they can't, the alert is routed wrong, and that's a configuration problem, not a volume problem.

What ownership actually looks like

This isn't complicated. It's just not something most teams think about explicitly.

Every monitor has an owner. Not "the platform team" - a specific team or rotation that understands the service and the threshold. If you can't name who owns a monitor, it's an orphan.
Owners review their monitors. Quarterly is fine. Just look at them. Are the thresholds still right? Is the notification channel still active? Does the runbook still apply? If you haven't touched a monitor in a year, something's probably drifted.
Every alert is actionable by the person receiving it. If an alert fires and the responder's only move is to forward it to someone else, the routing is wrong. Fix the routing, don't just add another escalation level.
Unowned monitors get deleted. This is the hard one. If a monitor fires and nobody can explain what it does or what action to take, turn it off. If something breaks, you'll know it was actually doing something. If nothing breaks, you just removed noise.

The "delete it and see" approach

I know turning off monitors sounds scary. But think about it this way - if an alert fires and the team's response is consistently "ignore it," it's already functionally deleted. It's just still making noise. Turning it off just makes the reality official.

The real risk isn't deleting a monitor. It's keeping a hundred monitors that train your team to ignore alerts. Because when a real one fires, it looks exactly like all the noise they've learned to tune out.

Making ownership stick

The hardest part isn't setting up ownership - it's maintaining it. People change teams, services get handed off, reorgs happen. The monitors don't update themselves.

Some things that help:

Tag monitors with team names. Most tools support this. team:payments on every payments monitor. When someone leaves or a team changes, you can at least find what needs updating.
Add ownership checks to your monitoring audit. When you review your stack, don't just check coverage - check that every monitor has a current owner who knows it exists.
Treat monitors like code. They need reviews, they need maintenance, and they need to be cleaned up when they're no longer relevant. If you'd never let dead code sit in your repo for two years, why let dead monitors sit in your Datadog?

This is part of why we built Cova the way we did - it doesn't just check if monitors exist, it checks if they're configured properly, if escalation policies have valid responders, if alerts actually route to someone who can act on them. The ownership gaps show up as coverage findings alongside the technical ones.

But even without tooling, just asking "who owns this monitor and what do they do when it fires?" will surface most of the problems. The answer is usually silence, followed by someone admitting they've been ignoring it for months.

Fix the ownership and the noise fixes itself. Not the other way around.

See which of your monitors are orphaned, misconfigured, or routing to the wrong team.

Try Cova - Free

Related: How to Audit Your Monitoring Stack - a practical checklist for finding blind spots before the next incident.