Thursday, 7 October 2021

What Caused the Facebook Outage?


Firebrand's senior instructor and cyber security expert, Phil Chapman, has written a thought-provoking blog on the recent Facebook failure and its effect on all of their platforms.  

A lot of users, consumers and companies were hit by the recent Facebook outage in early October. Speculation ran high as to the cause of the outage and most probably a fair percentage of us instantly thought 'cyber attack'.  

Russians? Chinese? North Koreans? ISIS? Insulate Britain? The list is endless.

The motives behind attacks such as Denial of Service (DoS) are equally endless. Money, political gain, industrial espionage, terrorism and simple kudos.

The post-outage analysis will no doubt rumble on but there was one thing that caught my eye on the official statement from Facebook Engineer:

“People and businesses around the world rely on us every day to stay connected. We understand the impact that outages like these have on people’s lives, as well as our responsibility to keep people informed about disruptions to our services. 

We apologize to all those affected, and we’re working to understand more about what happened today so we can continue to make our infrastructure more resilient.”

And the word that leaps out at me is resilient.

We are often too embroiled in the world of ‘cyber’ as new technologies, threats and the associated nerdy definitions emerge to focus on threat – often forgetting one of the key aspects of this whole risk equation which is the impact.

The Facebook outage is just one example of many that we all face at every level day-to-day on our networks and ultimately, it has a lot to do with the threat actors out there but also the vulnerabilities and dare I say it, fragilities of the networks we work with.

The first shout that went out after Facebook went down was – “it’s a DNS problem”. This made me smile as I remember from the early days in my career and long before I even understood what DNS was – that it was always the blame for everything on the network. 

 It’s almost like an IT Crowd response, rather than 'have you tried turning it off and on again?!' “It’s probably a DNS issue”, was often correct - DNS is only as good as the services around it and the administration put into it. Like pretty much everything in the world of networking and IT.

Every network regardless of its shape and size is dependent upon the services, applications and more importantly the protocols which hold it together. Those of us that like to work in these areas understand the concepts of OSI and TCP/IP. 

But when you look at how old most of these protocols are and how modern and technologically advanced the systems are that rely on them, is it any wonder that occasionally something breaks?

It’s a bit like having a very modern, all singing all dancing car with no petrol at the pumps. Technology is reliant on a lot of external forces to make it work and some of it is out of your control and some of it is as old as the hills. Break one element in the chain and it stops.

In the world of ‘Cyber’ and ‘Infosec’, this leads us to the more boring elements of resilience and redundancy which allows us to put as big a tick as we can in the ‘Availability’ box.

Your systems can have the most Confidentiality and Integrity you can afford, built in and be as sophisticated as you like, but if the glue holding it together gives way then you have a problem.

Protocols and applications such as TCP, IP, BGP and DNS all work like the glue on a network. They remain transparent to us and we rely on them every day. But as soon as one of them goes wrong or is misconfigured the impact can be huge.

The biggest threat to our network is the Insider Threat. The biggest type of Insider Threat is by far accidental and non-malicious actors. A click on a link, letting someone you don’t know in, sending an attachment to the wrong addressee, pulling out the wrong cable – it's all far less sexy than blaming the Russians, Chinese etc but is far more likely.

So until I hear differently, I will go with internal error as it’s probably true.

The biggest task that Facebook has is not finding the culprit or resolving the problem, but learning from it and building some resilience into their change management processes.