Analysis abounds as to why the Optus network failed last year, but as with all negative events like this one, the learnings are the most important takeaway.
For organisations preparing to set up their own telecommunications platform, the outage over the 2023 Optus network failure provides an opportunity to revisit and refine the essential steps in establishing a reliable and secure network.
Phil Martell, Head of Strategic Network Development at Vocus, acknowledged the significance of major outage events in highlighting the essential need for robust network resilience strategies for all Australian companies.
“We’re seeing industry-wide discussion on what’s needed to build networks that can cope with failures and keep going despite them – something that didn’t always get enough focus previously,” Martell said. “Organisations and their telco suppliers need to work hand-in-glove to plan networks with that level of resilience engineered into them.”
The starting point for any organisation planning a platform is to create a comprehensive strategy.
“Understand the organisational functions you have, how critical they are to ongoing success and how they depend on communications, because a lot of people actually don’t follow that chain all the way through,” Martell said. “Having an articulated understanding of this really allows an organisation to focus on implementing networks.
“Organisations are often big and it’s not always easy for everybody to understand everything. So, having a clear, articulated strategy that you can flow down to people really helps make it clear, and then doing that dependency analysis is really critical.”
A communication provider can’t do that for you, Martell added. The strategy needs to be developed in-house.
Expert telecommunications commentator Associate Professor Mark Gregory FIEAust of RMIT University’s School of Engineering said compliance is also a priority for anyone creating a network.
“There are risk requirements set by the regulator that have to be met,” he said. “That’s the first thing that anyone looking to set up any sort of telecommunications network should look at: what are the regulatory and compliance outcomes that have to be achieved?”
Map and segment
Gregory suggested one of the main faults that can cause an outage occurs when a management network is not separate from a user network. Martell refers to this distinction as IT versus OT, or operational technology.
“Often they’re handled quite differently within organisations,” he said. “Different people look after IT and OT, but often the links are bought for both. So you’ve got, for example, a mission-critical facility and an email system and one link serves both.”
Looking at requirements that way, Martell said, usually shows that some sites need more resilient connections than others. While the first response is often to add a backup connection to sites, he said the focus should be more on resilience, and that means diversity to avoid a single point of failure, also known as a shared fate.
Diversity is not only about being able to isolate segments of the network; it’s also about using different technologies for different functions. Wireless, cable and satellite all have different requirements and capabilities.
Martell said to consider the layers of the network and the layers of technology that can be used to increase site resilience. It’s about thinking broadly.
“When you design your networks, you’ve got to really think about each of the characteristics of those technologies and solutions, and design your redundancy plan around ‘what if?’” Martell said.
“For example, most organisations run 10 GB per second up to 100 GB per second now. If you’ve set up your fibre to run at 100 GB and you end up dropping to 200 MB because the fibre has been cut, it’s actually really difficult.
“Did you put in place traffic management so that the really important traffic, like that stuff that keeps the factory going, was identified? To exploit this, your redundancy plan has to understand the resources and platforms that are going to be available to it. It’s about understanding what your networks look like in that fault condition.”
It’s also about having the right tools in place to detect a catastrophic failure in the system, and then having critical solution processes documented so they can be implemented immediately.
Gregory used Optus as an example: the company suffered from a cascade of systems taking themselves off the network to protect themselves from an overload of messages related to the rebuild of the border gateway protocol (BGP).
He said the lesson is that organisations should have systems that monitor the BGP itself.
“BGP is one of the original protocols that was created at the very beginning of the internet,” Gregory said. “There’s less than a handful of the original protocols that we still use today; it’s not like BGP is something new.”
He also recommended that organisations building new networks consider using open systems and designing their business models around that.
“They need to look at new business models, not legacy business models, for how they’re going to actually conduct business,” Gregory said.
Invest in engineering
Martell believes that funding engineering through corporate investment programs and corporate staffing programs can sometimes look like a luxury.
“The issue really is that engineering can look very easy until it goes wrong,” he said. “Not every organisation needs an engineering culture, but when it is needed, having that process and that discipline is really quite important.
“I see sometimes that there’s a trend towards commoditisation, that implies that the engineering discipline is not needed.
“Discipline really comes down to things like ISO standards and accreditation and the quality, and engineering organisations have done a great job over the years of developing standards such as ISO 27001, and you should follow them and use them to get the benefits.
“Unfortunately, in telco there’s sometimes a culture that it’s becoming a commodity of IT – and to some degree it is – but there are also areas where it isn’t. It’s sometimes easy to not understand the commitment to engineering that’s required.”
Gregory agreed that organisations need to be aware of the dangers of prioritising cost savings in a way that leads to a lack of funding for the engineering function.
“Organisations need to ensure that the engineering side of the business is not left destitute,” he said.
“I don’t think the engineering function is a priority for many organisations. The fact that Optus had to send engineers in to reboot the systems means that they didn’t have the ability to reboot them remotely, which is a capability that has been around for 30 years.
“That’s why for the last five years I’ve been saying we need minimum performance standards in the telco industry.”