Pages

The non-stop machine

I find failure utter fascinating.  Why did a particular thing fail? Why was it that thing and not something else?  Failure is especially important in investing, but also in other aspects of life.  While there are a ton of article (and a number on here) about failure in the personal or professional sense I want to talk about failure in the mechanical sense.

It's fun to think about really hard problems and try to back of the napkin a solution.  Things like "how would you stretch a single piece of string around the world?"  Or "could you design a machine that never fails?"

The second question is realistic, and one that I had a discussion about with a friend that spurred this post.  She helped invent the ATM system and was describing the challenge of building a robust system that never lost a transaction.  And that set me on a journey thinking about mechanical failure.

In a physical system failure can usually be reduced to the weakest link.  The weakest part will break down first.  But the ultimate failure might not be that part, in some systems it's possible for a weak part to fail and the machine to continue, but the failure increases stress on another part that ultimately fails.

Think of an engine, a simple $30 head gasket kit is the difference between a functioning engine, and sitting on the side of the road.  You rarely hear of a piston failing, it's a head gasket that cracks, a simple seal that goes bad.  But that failure creates a cascading effect.  A cracked gasket allows coolant into the oil, and a coolant oil mixture gunks up the pistons and ultimately the engine seizes.  That $30 part can create a multi-thousand dollar repair.

So what about critical systems?  This is where things get exciting.  If a non-critical system fails the system is down until a replacement part can be procured.  We see this all the time in life.  A gas pump will have an out of order sign, or we'll be told "that machine isn't working today."  But what happens when it's a ventilator that goes bad? Or a core banking system? Or the guidance system for aircraft?

The usual solution is to build in redundancy.  This is what Boeing is facing with their 737 MAX.  The aircraft had a critical sensor, a single critical sensor that if it had a bad reading could result in an error situation.  The obvious fix is to add a second or third sensor and correlate data between them to ensure no errors.  In aircraft redundancy is key.  Planes have multiple engines multiple pilots, multiple electrical systems etc.

In an aircraft you can make most things redundant because it is a completely isolated system.  An airplane has everything it needs itself when in the sky.

This redundancy concept is also used in computers.  Instead of a single hard drive put in multiple drives that mirror themselves.  Or put in multiple computers that mirror themselves.  All the way up to double everything, power, cooling, machines, everything into a massive distributed system.

The concept of massively distributed systems are what dominate our computing now.  The Google idea of having millions of generic computers that when they fail can be replaced without disruption is popular.

But this concept of massive distribution, or clustering hasn't always been the only way.  My friend who built the ATM network worked at a financial service provider along with some banks.  Their concept of "no failures" was quite different, and something I find utterly fascinating.

In the 1970s a company named Tandem was formed.  Tandem built computers that ran non-stop.  Once booted they never stopped.  In the late 90s Tandem was purchased by Compaq, which in turn was purchased by HP.  And like the computer systems this division has continued non-stop as well.  Now HP has non-stop computers.

The concept behind a non-stop computer is simple.  The entire system is designed for resiliency.  Everything is engineered to last as long as possible and built in a modular fashion.  This means when anything fails it can be removed and replaced without having to shut the machine off.  You can remove ram, or a processor all while the computer is running.  You can even swap out the motherboard of the machine while it's running all without losing a single transaction.

And just like anything can be swapped for failure it can be swapped for an upgrade too.  I was told it isn't uncommon for these machines to be in continuous operation for 35+ years.  To me that's astounding.  Someone turned on a machine 35 years ago and it hasn't turned off or had any downtime since.

The reason these machines work so well is because they share nothing.  Each component is like the airplane, completely self contained.  Components talk to each other, but if a single one fails it brings down nothing else.

What I've noticed is in a lot of newer clustered or fault resistant systems there are a lot of shared components.  And this shared-ness is usually the cause of a cascading failure.  Truly fault tolerant or resistant systems have hard barriers between all aspects and failure is isolated and don't cause issues anywhere else.

It's interesting to think about this at a higher level.  Obviously almost everything in life is interconnected.  And everything is going to fail as well.  But I wonder how often we consider both of those things together and make sure whatever systems (or processes are built) handle and isolate failure to the smallest and most replaceable component?

I think the reason we don't do this is because it's expensive.  It's expensive to design for failure in mind and expensive to build out redundancy.  But there is a cost to failure, and we are ignorant if we don't think things are going to fail.

The best systems are the ones where the designer sat down and said "how can this break?" before designing the final solution.

The same failure design thinking that goes into computers or machines can be extrapolated to systems, processes, businesses, or really anything.  You need to consider what can go wrong first, then build out redundancy and ways to isolate the failure before you can have a robust system.

There are too many things in life that work until they don't.  And the "they don't" is a result of short term thinking or expecting things to always just work.  If you can't envision failure it's impossible to ensure long term success.

1 comment:

  1. Or it's a result of them being purposefully designed to fail (planned obsolescence).

    Just consider the light bulb in Livermore, CA that's been burning for 118 years and the subsequent Phoebus Cartel that conspired to make light bulbs burn out faster. Hopefully it's not happening with planes and ventilators, but mobile devices?......

    ReplyDelete