- As telecom switches to software, reliability stops being a given and becomes a casualty
- Software velocity is outstripping carrier engineering teams’ ability to manage their infrastructure
- The LAN–WAN boundary is no more, replaced by a unified data utility where failures can cascade system-wide
When my phone stopped working last month during the latest calamitous network outage to hit the US, I assumed it was that pesky Amazon cloud POP in North Virginia playing up again. After all, it was AWS’s VA data centre that triggered three (3) of the last big network failures in North America.
But no! This time it was Verizon’s telecom network that was the root cause.
Get ready for more of this kind of thing.
The industry’s rapid pivot from bespoke hardware to software-first telecom architectures—dominated by AI, Cloud-Native Network Functions (CNFs), Kubernetes, Open RAN, 5G-Advanced network slicing —is justifiably lauded for its promise to reduce OPEX, improve utilization, and streamline operations at scale.
But there is a price to be paid for this pace of change. The telco software revolution is operating at the absolute bleeding edge of technology, introducing a level of complexity into carrier networks previously seen only in nuclear submarines and ASML’s extreme ultraviolet (EUV) lithography chimaera.
Debugging, but no disclosure
Verizon hasn’t disclosed what caused the outage, aside from saying it was “a software issue.” (Clearly, there is a separate column to be written around the efficacy of a U.S. regulatory environment that permits Tier 1 carriers to obscure responsibility for critical infrastructure failures rather than providing rapid, substantive disclosure of what failed and why, but let’s stay on topic.)
Industry sources suggest the problem originated in the codebase for Verizon’s 5G standalone (5G SA) core network. That doesn’t really narrow things down, given that its 5G core is provided by an ecosystem of no less than five vendors — Casa Systems, Ericsson, Nokia, Oracle, Red Hat/OpenShift — running alongside Verizon’s own software.
In other words, this is a textbook example of the challenge facing the entire telecom industry. Carrier engineering teams that once managed hardware from a single supplier now operate across multiple OEMs, cloud platforms and overlapping generations of mobile technology — from 2G through 5G — layered on top of a 5G standard so vast that, if printed on A4 paper, it would reach the height of a six storey building.
Team effort
The problem here does not lie with the engineering teams at Verizon or at other Tier 1 carriers such as AT&T, Vodafone, NTT, Deutsche Telekom and Orange.
Their technical staff are in an elite class whose peers are aerospace engineers or those responsible for maintaining national power grids — roles that involve high-wire acts of equal magnitude, where failures are public, costly and potentially fatal. It is not as though better engineering teams are waiting in the wings elsewhere.
The issue is structural. These teams are now facing technology headwinds that did not exist a decade ago, and are expected to operate software systems whose complexity is increasing faster than any organization’s ability to realistically understand, test, or control them.
The precise cause of the problem on Verizon’s network may not be known, but its impact is. Like all carriers, Verizon aspires to five nines (99.999%) reliability, which has been the gold standard for the telecom industry since the 1970s and equates to just 5.26 minutes of downtime per year. But Verizon’s core outage lasted 10 hours, which collapses that metric to two nines and change. That’s bad airport WiFi or AWS Virginia cloud POP-level availability.
The price of failure
Outages like the one last month remain aberrations in the telecom world, but any erosion in reliability carries disproportionate consequences for the global economy.
At the same time as carrier networks become more software-defined and operationally complex, the traditional telco edge is moving outward — into industrial domains that historically operated in air-gapped OT silos, insulated from public network failures.
With the shift to Industry 4.0, carrier infrastructure will play a new role: underpinning real-time automation, control systems, and machine-to-machine coordination across manufacturing, energy, transport, and logistics. Outages that were once survivable “I can’t post to TikTok” inconvenience become systemic and economic risks when operational data is pivotal and tightly coupled to physical processes. Industry 4.0 cannot function without carrier networks — and it cannot tolerate declining reliability as complexity increases.
Take a wild guess
This is an existential issue, particularly in the U.S., where hyperscalers are adding a new AI-driven uncertainty layer to the reliability risk matrix. Amazon Web Services, Microsoft Azure, Google Cloud, IBM Cloud, Oracle and Alibaba Cloud are no longer positioning LLMs as standalone answer engines. Instead, they are reframing them as probabilistic reasoning and orchestration layers that sit upstream of deterministic, sector-specific systems — including telecom networks and other industrial environments.
Across telecom, healthcare, finance, supply chains, energy, defense and transport, LLMs are increasingly tasked with interpreting unstructured data, generating scenarios, prioritizing actions and coordinating workflows, while deterministic systems are left to execute. Hyperscalers describe this as a hybrid model. In practice, it introduces a decision-shaping layer whose outputs are probabilistic, opaque, and increasingly trusted — upstream of systems that cannot afford to be wrong downstream.
In that world, five-nines network reliability becomes irrelevant if operational decisions are being guided by systems that can sound authoritative, persuasive and wrong at the same time.
Could do better
This all has serious implications for carriers. The traditional network separation between LAN and WAN, which helped isolate failures to one domain, no longer exists. Today’s networks are ubiquitous critical information utilities where a failure of any element can impact the rest of the digital biome.
The situation has left me wondering whether there is a silver lining to the existential shifts currently sweeping over the communications industry. Is it possible that, as carrier reliability erodes, cloud hyperscalers, which typically operate at three nines reliability levels, could inject some telco discipline into their operating standards to raise their fault-tolerance game? Could aggregate uptime on interleaved cloud and telco networks perhaps meet in the middle four nines, say.
At first blush, that might seem unlikely. Hyperscalers are known for their move-fast-and-break-things culture, which is ill-suited when the “things” being broken are SLAs, KPIs, and QoS guarantees.
But during a recent interview with the head of Google’s fiber network, John Keib, it was apparent that the hyperscaler has not only made room for telco-class oversight, but embraced it as a core value.
Asked whether Google Fiber was racing toward Level 5 autonomy, Keib, who has serious telco chops, was disarmingly pragmatic: “I don’t know if we ever get to five, to be honest with you.”
Coming from a division of a hyperscaler, that restraint is both surprising and encouraging — an acknowledgement that, in networks bound by SLAs and physical infrastructure, full autonomy is not a virtue in itself, and that human oversight remains essential.
It gives me hope that ultra-reliable networks may not be a thing of the past.
For more of this kind of thing, click here.
Steve Saunders is a British-born communications analyst, investor and digital media entrepreneur with a career spanning decades.
Opinion pieces from industry experts, analysts or our editorial staff do not necessarily represent the opinions of Fierce Network.