Network management for dummies: analyzing a total outage

Page 2: The ten recommendations

Contents

The authors of the report have drawn up ten recommendations for telecoms network operators:

  1. Protect routers against overload
  2. Physically and logically separate the management network from the user data network
  3. Backup connections through other network operators for important network areas
  4. Check configuration changes before implementation, involving various departments (engineering, operations, project management) and, in the case of critical infrastructure, the suppliers
  5. Laboratory tests under realistic conditions for configuration changes before implementation
  6. Not too many changes at once
  7. Automatic reversal of failed changes
  8. Avoid alarm fatigue (configuration changes should only trigger alarms if they are important changes)
  9. Give employees SIM cards from other networks so that they can be reached in the event of a network failure
  10. Simulate network failures and practise remedial measures

Unfortunately, there are several simultaneous trends in the telecoms industry that are undermining the reliability and resilience of networks. "These include the development towards network platforms in the cloud, virtualization and softwareization of networks, increasing use of artificial intelligence for automatic network configuration, preparation for IT security in the age of quantum computers (post-quantum security) and the convergence of terrestrial and other networks," states Xona Partners and derives additional recommendations for technology and process optimization:

Technical recommendations:

  • Near-earth satellites are to serve as a backup connection and enable emergency calls via a direct connection to standard smartphones.
  • The 3GPP standardization association is working on provisions for mobile roaming in the event of a disaster; network operators should prepare for their implementation.
  • Network operators should consider providing apps as an alternative to SMS/MMS or phone calls, including for emergency calls. This would help in the event of certain systems failing.
  • E-SIMs are programmable; network operators should therefore use this option to enable roaming in competitor networks in the event of outages.
  • If a competitor network takes over in the event of a network failure, it can become overloaded. New approaches such as the shared use of network capacities and the activation of frequency spectrum reserved for emergencies can help here.
  • In addition, cooperation (practiced in advance) with content delivery networks (CDN) and large multimedia providers (YouTube, Netflix, etc.) can help to reduce data volumes through dynamic traffic management in the event of network problems.
  • (More) redundancy when connecting critical infrastructure

At process level, responses to disruptions should be practiced in order to uncover weaknesses in plans and training. This also includes the collection of key performance indicators (KPIs) and a clear allocation of roles within the workforce. Network operators should calculate in advance what financial impact network outages could have on them. This helps to provide appropriate resources and ultimately protect their image and financial stability. During a network outage, providers should inform the public how to make emergency calls and receive warning messages.

Rogers Networks has taken a number of measures. The routers are now protected against overloading by IP routing data. There is now also a separate management network with redundant connections from independent network operators.

There are also changes to change management. A new algorithm is to better assess the risks, cooperation between different teams in the company is to have been improved, configuration changes are to be assessed by new software and tested in the laboratory before they are installed, and there are new procedures for the introduction of new hardware.

In addition, the network operator has revised its incident response playbooks; they now take into account a wider range of possible failure scenarios, better define responsibilities, provide for automatic reversion to the previous configuration in the event of failed changes, make differences in the priority of automatically triggered alarms, and finally all employees responsible for incident response and crisis management have redundant telecommunications from other network operators.

Rogers is also setting up a separate core network for mobile communications to reduce the risk of both fixed and mobile networks going offline at the same time. This project has not yet been completed.

The authors of the report believe that these measures are sufficient to improve resilience and reliability and prevent a repeat of the outage in July 2022. Nevertheless, they have additional suggestions:

Emergency roaming with other mobile networks should be tested, under more scenarios. In principle, Rogers customers already have the option of making emergency calls via other networks - but this only helps insiders if the Rogers radio network is working but the transmission network is not (as happened in July 2022) because the cell phones do not automatically connect to the other networks.

In preparation for future disruptions, Rogers is to develop a process for detailed analysis to better identify the effects, cause(s) and remedial measures. Rogers will then share its findings with other network operators so that they too can better prepare themselves. Tests of configuration changes are to become stricter and more comprehensive. More testing tools are needed for this, especially as network technology is constantly evolving. Rogers is to expand its incident management exercises and provide its customers with better information on how to make emergency calls in the event of a network disruption.

(ds)