41% of Teams Still Learn About Outages From Their Customers

March 2026 · 10 min read

New Relic just published their 2025 Observability Forecast - a survey of 1,700 IT and engineering leaders across 20+ countries. It's the kind of report that's mostly written for enterprise buyers, but buried in there are some numbers that should make anyone running production systems deeply uncomfortable.

Let's skip the marketing language and look at what the data actually says about the state of monitoring in 2025.

The headline numbers

73% of organizations lack full-stack observability - leaving broad segments of their infrastructure and applications without comprehensive monitoring.

Let that sink in. Nearly three out of four organizations don't have complete visibility into their own systems. Not "could be better" visibility - actual blind spots where things can break and nobody will know until it's too late.

$2M per hour - the median cost of a high-impact outage for organizations without full-stack observability. That drops to $1M/hour for those that have it.

The cost difference between having full observability and not having it is literally double. And that $2M figure is the median - half of organizations are losing more than that per hour of downtime.

41% of leaders reported they still learn about service interruptions from customer complaints, incident tickets, or manual checks.

This is the one that hits hardest. Four in ten organizations are finding out about outages from their users. In 2025. With all the monitoring tools available today. Something is clearly broken, and it's not the tools themselves.

The tool sprawl trap

Here's where it gets interesting. The average organization runs 4.4 observability tools. That's actually down 27% from 6 tools in 2023 - teams are consolidating. But 4.4 tools is still a lot of places where configuration can drift, alerts can go silent, and coverage gaps can hide.

The report says 36% cite complex technology stacks as their top challenge, with 29% pointing to too many monitoring tools or siloed data. These are two sides of the same coin - every tool has its own configuration, its own alerting rules, its own way of defining "healthy".

And 52% of organizations plan to consolidate onto unified observability platforms in the next 12-24 months. The industry knows tool sprawl is a problem. But consolidation takes time, and in the meantime those 4.4 tools need to actually work together.

The gap between tools isn't a future problem to solve with consolidation - it's a present problem that's causing outages right now.

A third of engineering time goes to firefighting

This one's painful: 33% of engineering time is spent on reactive tasks and firefighting. Another 33% goes to maintenance and technical debt. That means roughly two thirds of your engineering team's time is spent on things that aren't building new features.

The report frames this as a productivity problem (which it is), but it's also a retention problem. Nobody went into engineering to spend their days chasing alerts at 3am. The top benefit practitioners cited from better observability was reduced alert fatigue at 59% - ahead of faster troubleshooting (58%) and better collaboration (52%).

Engineers don't just want better tools. They want to stop being woken up for things that could have been caught - or prevented - earlier.

Full-stack observability actually works (when you have it)

The data makes a strong case that comprehensive monitoring coverage pays off:

Outage frequency: 23% of teams with full-stack observability experience weekly high-impact outages vs 40% without
Detection speed: 28 minutes average MTTD with FSO vs 35 minutes without
Cost: $1M/hour vs $2M/hour for high-impact outages
ROI: 75% of businesses report positive returns, with 18% seeing 3-10x ROI

The 7-minute difference in detection time might not sound like much, but at $33,333 per minute of downtime, those 7 minutes cost roughly $233K. Per incident.

So why does the gap persist?

If the ROI is this clear, why are 73% of organizations still operating without full-stack observability? Based on the data, a few reasons:

1. Nobody owns monitoring holistically. Different teams own different tools. The SRE team manages PagerDuty, the platform team manages Datadog, the application team uses Sentry. Nobody has a cross-cutting view of whether everything is actually covered.

2. Coverage drifts silently. You set up monitoring for a service, it works fine for months, then someone changes a metric name during a migration, or a team member leaves and their PagerDuty schedule goes stale. The monitoring looks like it's there, but it's not actually doing anything.

3. New services ship without monitoring. The report mentions that software deployments cause 28% of high-impact outages. When teams move fast, monitoring is the thing that gets skipped. "We'll add it later" is one of the most expensive sentences in engineering.

4. "We have monitoring" isn't the same as "we have coverage." You can have Datadog, PagerDuty, Grafana, Sentry, and New Relic all running and still have endpoints with no checks, escalation policies with gaps, and dashboards referencing metrics that no longer exist. Having tools and having coverage are very different things.

What the report doesn't cover

The report is thorough on the macro picture, but it focuses on adoption rates and business outcomes. What it doesn't get into is the specific, practical question: how do you actually find the gaps?

Knowing that 73% of organizations lack full-stack observability is useful context. But what an engineering team actually needs to know is: which of our 40 endpoints have no monitoring? Which PagerDuty escalation policies have stale schedules? Which Datadog monitors haven't fired in 90 days - are they working or is the threshold wrong?

That's the layer between "we need better observability" and "here's specifically what's broken." It's also the layer that's hardest to do manually, especially across 4+ tools.

This is the problem Cova was built to solve - it connects to your existing monitoring tools and audits them for exactly these kinds of gaps. Not by replacing anything, but by telling you what's missing, what's stale, and what needs attention. It also scans PRs so new endpoints don't ship without coverage in the first place.

The takeaway

The data from 1,700 organizations paints a clear picture: most teams have monitoring tools but don't have monitoring coverage. The gap between "we use Datadog" and "all our critical services are actually monitored" is where the $2M/hour outages live.

If you want the full report, New Relic published it as the 2025 Observability Forecast. Worth reading, especially the sections on tool consolidation and AI-assisted troubleshooting.

And if you want to actually find the gaps in your setup - try the Cova demo. No signup, runs in your browser. It walks through a sample monitoring stack and shows you the kind of blind spots that 73% of organizations are sitting on right now.

See what 73% of teams are missing

Try Cova - Free

Related: How to Audit Your Monitoring Stack (Before the Next Incident Does It for You) - a practical checklist for finding blind spots.