BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Too Many Single Points Of Failure Threaten Our Digital Infrastructures — & They’re Multiplying

Following
This article is more than 2 years old.

When your modem dies, so do you – professionally.  But that’s only one single point of failure (SPOF) in your personal technology infrastructure.  But what about your company?  How many does it have?  And the government?  The military?  OMG, there are thousands.  Redundancy is too expensive.  Hardware and software are too unstable.  What’s the answer?  There isn’t one.  Unfortunately, like so many of our computing and communications problems, SPOF problems may not be solvable.  In fact, given the complexity of our digital infrastructures, it may already be too late to “solve” anything.  We must learn to adapt to the complexity we’ve engineered into our personal and professional lives – and accept that we’ll all digitally die from time to time.  

What Happens When Your Modem Dies?

Just about everyone in the technology industry worked from home in 2020 for at least part of the year.  Many well into 2021, and many will work from home forever.  What happens to your productivity if your modem dies?  You probably use MS Office 365, Zoom, Webex or Teams, and you back up all of your work in someone’s else’s cloud – a cloud you do not manage or protect.  If your modem dies, or is hacked, and 10 restarts doesn’t do the trick, then you have to replace it.  If the modem needs replacing, you can wait for a new one to be sent to you or get in your car (assuming it’s not electric if you’ve not lost power) and drive to your provider to get a replacement.  But what if you live in a rural area, cannot drive, your car is broken, or it’s been snowing all night?  You scream out loud but no one can hear you.  You lost a day’s work and maybe much more.  Your state of mind is horrible.  All of a sudden you hate everyone connected with your connectivity — especially your provider.

“They’re all idiots.”

“None of this stuff works when you need it most.” 

“Don’t talk to me!  I have to figure this out.” 

“I know, I’ll use my phone, but it’s out of power and now my power’s out.  Damn snowstorm.”

“I know I should have bought a generator.  How much does it cost?  Who even installs generators?  Do they really work?  Or maybe solar panels?  How quickly can they be installed?  I heard the government gives them away.”

“I should have kept my land line.”

“No one cares about the meetings I can’t make.  Or the report I can’t submit.  Or the test result I’ve been waiting for.  Or when I need to schedule my kids’ activities.  I’m off the grid, and, no, it doesn’t feel ‘liberating at all.  I’m all alone.” 

Venting when no one can hear you is unsatisfying.  Venting is all about sucking everyone into your vortex of digital frustration.  When you can’t do that, life really sucks.

Too much?  I don’t think so.  Your modem is just one single point of failure (SPOF).  You don’t keep a spare one around.  There’s no redundancy in your professional life support system.  You’re dead.  Until someone else saves you — when they – or you – get around to it.

So that’s your personal digital infrastructure.  How many single points of failure does it have?  Modems, desktops, laptops, phones, chargers, networks, access, wires – the list is pretty long.  If any of the components die (and you haven’t designed redundancy into your processes) – or are hacked – you’re out of the game.  They’re all single points of failure.  What about your company?  What about the government?  The military? Does anyone remember the GPS SPOF? (Fortunately, DARPA is working on adaptive satellite redundancy.)

Single Points of Failure

Systems engineers worry a lot about single points of failure, especially when there are lots of them:

“A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working ... SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system … the assessment of a potential SPOF involves identifying the critical components of a complex system that would provoke a total systems failure in case of malfunction.  Highly reliable systems should not rely on any such individual component.”

Look around and count the number of points of failure.  Do it for your personal and professional digital infrastructures.  Research a little about how many single points of failure exist across the government infrastructure, especially the military.  Worse, understand the concept – and reality – of “cascading failures:”

“A cascading failure is a process in a system of interconnected parts in which the failure of one or few parts can trigger the failure of other parts and so on.  Such a failure may happen in many types of systems, including power transmission, computer networking, finance, trans-portation systems, organisms, the human body, and ecosystems.   Cascading failures may occur when one part of the system fails.  When this happens, other parts must then compensate for the failed component.  This in turn overloads these nodes, causing them to fail as well, prompting additional nodes to fail one after another.”

How many of these are there?  They’re everywhere.  In our homes, our offices, our business models, countless business processes, our governments and, yes, in our weapons systems.

What Have We Done?

We have engineered perhaps the most dangerous, vulnerable complexity imaginable.  Multiple single points of failure are everywhere – and growing.  Edge computing & the Internet of Things (IOT) only exacerbate the risks, as they increase market opportunities for the vendors who provide the dangerous complexity.  Nor do we routinely assess complexity and vulnerability.  For a variety of reasons – mostly related to cost – we look the other way while adding layer after layer of systems complexity without engineering the redundancy, reliability or accessibility necessary to reduce the risks.  All of the events we hear about every day, all of the cybersecurity breaches, all of the ransomware, all of the power grid failures and all of the communication network outages are to a significant extent the result of the failure of single points of processing, connectivity and execution.

What Can We Do?

There are always steps we can take to reduce the risks.  Let’s start with cybersecurity.  At home (and work too) everyone should access the Internet via a Virtual Private Network (VPN).  Ideally, we should all have redundancy in our access devices – smartphones, tablets, laptops and desktops.  Keep a compatible spare modem around and for power, well, it can get expensive to install solar panels or install a generator.  While I don’t want to turn anyone into off-the-gridders, there’s something to be said for tilting in that direction. (Yeah, there are password and authentication problems too.)

Enterprise computing is a whole other challenge.  As more and more computing moves to the cloud we need to better understand accessibility, reliability, security and redundancy at all points in the process.  For example, we have no control over the replacement schedule for the “parts” that run the cloud supply chain.  Nor are all of the SPOFs identifiable in especially public clouds.  Multi-tenancy is a SPOF waiting to happen with only your service level agreement – and the capacity of your cloud provider – between you and downtime.  One way to mitigate availability risk is to optimize the use of “availability zones/regions” where you can “select” zones/regions for better access, back-up and recovery.  Another more obvious way is to run your applications on multiple cloud infrastructures, or, more arcanely, mirror applications and data bases on-premises redundant with your cloud portfolio.  Of course, that’s a ridiculous strategy since it undermines every argument there is about why everyone should move to the cloud.  (Not to mention how many CFOs would suffer cardiac arrest if any company ran on-premise exactly what it wanted to leave behind to save “all that money” in the cloud.)

All this starts with an assessment of the location of SPOF, the approval of SPOF audits.  Qualified professionals should look at the hardware, software, networks (including network providers), data bases, storage devices, architectures and anything else that comprises the computing and communications infrastructure that runs your business.  Then identify the accessibility, reliability and redundancy risks connected with each element, process and component.  And then, and here’s where it gets expensive, procure and engage redundant hardware, software, networks (including network providers), data bases, storage devices, architectures and anything else that comprises the computing and communications infrastructure that runs your business.

How many SPOF audits have you seen?  They’re not best practice audits, that’s for sure.  Part of the problem is addressing the results of SPOF audits.  Enormous amounts of money would be required just to get through a short executive summary of the result.  SPOF audits also threaten profitable growth trajectories:  who wants to be the one that demands their company have redundant devices, networks and cloud providers?  CIOs and CTOs who’ve argued aggressively for cloud adoption would have to explain why they’re taking their companies into rough SPOF waters.  Edge computing and IOT, among other emerging technologies, would also have to be re-imagined.  All this and cyberattacks of all kinds, all the time.

Unfortunately, like so many of our computing and communications problems, SPOF problems may not be solvable.  In fact, given the complexity of our digital infrastructures, it may already be too late to “solve” anything.  The best we can do is recover when SPOFs occur – because they will.  The inter-dependencies built into our digital infrastructures is no different from the ones we’ve designed into our manufacturing supply chains, our energy pipelines, our international trading agreements and everything else that relies upon hardware, software, networks, people, processes and relationships over which we have limited control.  It is what it is, and will likely always be.  So expect them, because they’re coming.

Check out my website