Resilient systems: Accepting Faults and Failures

There is no bullet-proof code.

The resilient systems are those which accept faults, and are able to auto-recover from the fault. These systems do not allow faults to escalate into failures.

Fault: A problem which deviated from the expected and/or intended behavior.

Failure: The inability to continue serve responses for incoming requests outside of the acceptable scope.

As per all warranties, it is best for developers to define what the warranty of their application would provide and design the system to deliver the warranty. Usually, the term SLA (Service Level Agreement) is used to communicate the warranty agreement.

Do not aim for 100%. A fault will bring you to the five-nines (99.999%). If the system is able to guarantee the five-nines, under documented assumptions (read: requirements), then the system is resilient as it can be.

blog comments powered by Disqus