Network management for dummies: analyzing a total outage

ChatGPT A major network operator's customers, including emergency services and banks, were offline for 26 hours. Two years later, the cause is revealed.

Save to Pocket listen Print view
Antennas on a mast

(Image: Daniel AJ Sokolov)

11 min. read
Contents
This article was originally published in German and has been automatically translated.

Total outage at Rogers Networks: On the morning of July 8, 2022, twelve million customers of the Canadian telecom market leader are suddenly offline, mobile and landline users alike. For 26 hours, they are unable to make calls, transfer data or even make emergency calls. Stores cannot sell anything because the cash registers are on strike. ATMs are out of action, as are the banks' transfer systems. The damage to Canada's economy runs into the billions. The Canadian government sounded the alarm bells and commissioned an investigation.

Two years later, the government agency CRTC (Canadian Radio-television and Telecommunications Commission) publishes a summary of a report by the telecom consulting firm Xona Partners. (The full version is currently being cleared of business secrets and will be published later, as the authority told heise online). The document puts its finger in three wounds at once: Resilience, change management and crisis management. The expert is amazed, the layman is surprised.

From a technical point of view, one prerequisite for the total failure was that Rogers had a standardized IP core network for landline telephony, Internet and mobile communications, which was essential for routing data within its own systems, exchanging data with other network operators and connecting to the public Internet. Standardized ("convergent") core networks for mobile and fixed networks are common in the industry because they are powerful and cheaper – but they form a single point of failure. And so the collapse of the core network caused all of the Canadian market leader's telecom services to collapse simultaneously.

In the weeks leading up to July 8, 2022, Rogers worked on a seven-phase upgrade of its core IP network. The network technicians completed the first five phases. But on July 8, phase 6 was due, an update of routers (distribution routers) that handle traffic between the users (access layer) and the core network. One of the tasks of the distribution routers is to decide which data to forward and how, based on predefined rules (access control list).

In the course of the update, Rogers made a major mistake: the access control lists were simply deleted. This meant that the distribution routers forwarded unlimited instructions for processing data packets (IP routing data) to routers in the core network. There, a predefined quantity limit is supposed to protect the routers from overload: if so many data packets pelt into the routers that they cannot process the quantity, they have to throw data packets away. Unfortunately, however, Rogers Networks lacked this quantity restriction. The routers in the core network relied on the fact that the distribution routers would not send too much routing data. However, when they unleashed an avalanche on the core network, the routers there collapsed under the load after a few minutes.

Everything came to a standstill.

The deletion of the access control lists was a well-meaning but ill-advised attempt to tidy up the configuration of the distribution routers. "Change management, which includes prior review of the parameters to be changed, failed to identify this error," the report states.

During the preparation phase, Rogers had classified the seven-phase update of the core network as a high-risk undertaking. However, after everything had gone well phase by phase, the risk assessment algorithm gave the sixth phase a "low risk" rating. This meant that the employees were not required to exercise any particular caution; they did not have to obtain approval from higher management ranks or subject the changes to laboratory tests before they were rolled out in the production system.

So now the baby had fallen into the well and the brave network technicians had to get it out again quickly. This is why there is a maintenance network (management network), i.e. a separate network that allows certain employees access to the routers in order to maintain them, detect and rectify errors and, if necessary, reboot the devices. Even and especially when the network responsible for the actual traffic is down.

You would think so. Rogers Management Network was set up in such a way that it was also dependent on the IP core network. This meant that network technicians could not access the crashed routers from outside. At the same time, there was no redundant connection via external data lines. Rogers relied entirely on its own lines – this delayed the recommissioning considerably because the employees had to drive to the routers in person.

They just didn't know what was going on at first, and why, because Rogers' mobile network wasn't working. The fact that key employees did not have SIM cards from other network operators in their pockets so that they could communicate with each other if the worst came to the worst. Such precautions have been common practice in industries with critical infrastructure for decades, and not just in the telecoms sector. But at Rogers, messengers with SIM cards first had to be sent out to reach the employees responsible for crisis management and damage repair. This took up even more valuable time.

The consequences were bad. For 14 (fourteen) hours, the network technicians had no access to the log files. As a result, they were unable to find out why the network was down in the first place. To make matters worse, several configuration changes had been made that day. It was therefore initially unclear what had caused the breakdown. The choice initially fell on a change that was not actually responsible. Accordingly, undoing this innocent change did not help. Again, valuable time was wasted. Only when the actual error was found were the employees able to work through the schedules correctly and get the network up and running again.

Incidentally, the wireless network was not affected. The cell phones of Rogers customers therefore had the usual reception – but they couldn't do anything with it because the transmission failed due to the missing core network. SMS, phone calls, data – nothing worked. The mobile phone transmitters continued to broadcast their useless signal.

Unfortunately, this constellation had a nasty side effect: because the end devices received the signal from the Rogers network, they did not even try to log on to other mobile networks. They would not have been able to make normal calls or transfer data there, but at least emergency calls would have been possible via the foreign networks. Customers would have had to remove their SIM cards or deactivate e-SIMs in order to be able to make emergency calls, but very few consumers are aware of this.