YOW! September 2021 Recap
Cat Swetel
Cat Swetel kicked off YOW! September 2021 with a talk about Continuous Verification. Cat began by centering around the idea of what Chaos Engineering does;
“Chaos Engineering is the discipline of experimenting on a system in order to bring confidence in the system’s capabilities to handle failures in production.”
In traditional engineering, we do all we can to avoid failure. We plan, design and program according to specifications in the express intent of avoiding failure. Yet all of us know that failure happens - in many cases due to circumstances outside of our control. Thus;
“Why not treat failure as inevitable, and make more resilient systems?”
If Chaos Engineering’s job is to iterate over a system landscape and say, “is this safe?” then Continuous Verification asks, “What is safe?”.
At this point, Cat referred to some of the excellent work of Sidney Dekker;
“At the edge of chaos, systems have tuned themselves to the point of their maximum capability”
In other words, before a system starts failing (in whatever mode that may be) there is a state at which it is close to failure, but still operating correctly. In these states, the system is not only working, but working as hard as it can - a rope pulled close to its breaking tension but not quite at it. The “Rasmussen Triangle” is a way of helping to model this sort of scenario, and the conditions that lead to it.

In this triangle, there are three boundaries representing our pressures and constraints;
- Economic boundary; the cost of running our overall system
- Workload boundary; the amount of work by humans to maintain the system
- Performance boundary; how well the system is performing.
Of note on the performance boundary is a safety margin; lying just in front of the performance boundary, it represents the maximum safe operation of the system. Key to this triangle is the idea of the pressures that act upon it; in most cases, organisations want to lower costs. Likewise, we wish for systems to process as much as they possibly can to reduce waste, or minimise work for . These pressures in turn act on the safety margin that lies near the performance boundary.
Herein lies the problem of “Normalisation of Deviance”; the idea that we can perhaps shave _X_% out of our safety margins or processes and squeeze out that bit of extra perfrormance. Simplistically, imagine a server breaking through memory limits that trigger alerts - once, the team mistakenly ignored the alert and, guess what, the application continued running just fine. So the next time the alert goes off, there’s an attitude of “well, that was fine in the past, so I don’t need to get to that right away”. Only it doesn’t always work out that way - many high profile examples of the normalisation of deviance exist, the Space Shuttle program among them.
So, given that it is becoming impossible to model and build perfect systems for the environments in which they inhabit, Cat posits that we need tooling to monitor the drift of our safety margin over time - enter Continuous Verification. If there’s one thing I could say that the talk was a little light on, it was talk of the actual tools to use themselves; then again, Cat herself was keen to impress that it’s less about the tooling and more about the process of finding your safety margins.
Yes, you can Chaos Monkey it up and rip out a database connection to see what happens to your app - but Cat argues it’s actually more interesting to look at what happened without things going disasterously wrong. She relates how her team would regularly review ‘near miss logs’, and see actions that could have caused a big problem, but for one reason or another didn’t, and let that begin informing you as to where your safety margins lie.
Beyond the work of Dekker and Rasmussen, I thought this was one of the key takeaways from what was an interesting and engaging opening talk for YOW!; the idea that you don’t need fancy tooling or extensive detail of performance testing in production in order to begin defining where your safety margin lies. Your teams might have has this information in their hands this whole time.