These are my thoughts on the conclusions of Nicholas Tietz-Sokolsky made in his blog post “Lessons from my first (very bad) on-call experience”. Although I agree with the majority of his points, I thought it useful to write my own views down.
I understand the desire here to reduce the likelihood of persistent interrupted nights, but I don't think this is a good strategy.
It introduces a difficulty scheduling leisure-time in the long term. Plans made far in advance can be thrown out by an unfortunate event the night prior, which leaves you on-call in place of the sod that came before you.
It dilutes the ownership of a problem. Ownership is good in two senses. Firstly, ownership comes with the responsibility to see the problem mitigated or resolved. This is a simple technical benefit in that a managed problem is better solved than an unmanaged problem.
However the other merit of explicit ownership is less simple. Without explicit ownership over issues, engineers driven by either work ethic or investment in the success of the business will gravitate to a problem. I've seen this plenty, and you've probably seen it too - highly productive engineers that seemingly work contantly.
In extreme cases, we call those people “workaholics”. But there is an entire spectrum of unhealthy behaviour behind the term. Enforcing fair distribution of problems is an important measure that businesses should take as not just as a social good, but as long-term business planning. I think I see an element of this reflected in Nicholas’ first and second lesson.
A better alternative in my mind (and the one my employer uses) would be fixed rotas, with leeway given to work hours based on interrupted sleep. My preference is a weekly rota, with each engineer filling a 7-day period before hand-over, but a daily rota is no less possible. The core premise is that you know when you're on call ahead of time. Leeway for sleep lost is a minor concession on the part of a business, and one that is very hard to contend in the face of having worked hours outside of your regular shift.
Well, yes. But I don't think this is as simple an option as presented. It is very difficult to tell a sales team to not present a solution they're selling on your behalf as unproven without telling them exactly how proven your solution is.
As engineers, we'd like to think customers operate along a similar axiom - proof of concept, local deployment, testing, scaling, full production. But more often than not they don't, and they're trained to forego those sensible steps because there's ten alternative suppliers telling them they don't need to do any testing with their solution™. It's false, but finding a salesman that's good enough to tell a customer the competition is lying without sounding bitter is, as far as I can tell, impossible.
I've came across this decision a lot. Specifically whether to automate recovery of a service like OpenSIPS, or Asterisk, or Postgresql - or to ensure a human knows that it's failing and is forced to investigate the failure.
My concern with automated recovery is two-fold. The first concern is that by automating recovery, I run the risk of masking symptoms of larger problems. Using the example in Nicholas’ post, I'm not certain how restarting the service every time it failed, without human interaction, would have done more than mask the problem. These things tend to show up in reports that are dismissed or posted to thedailywtf - “database stops responding for 17 seconds whenever a select is made for username = ‘nancy’”.
The second concern is that as a system administrator, I always have things to do, and I am of the view to do them in the way with the least work required. This is why I wrote perl scripts. And why I now write bash scripts, and why core parts of our infrastructure are held together by battle-tested shell scripts. I fear automating restarts after failure as the last line to cross before falling into a slothfulness where problems are hacked away by quick, dirty changes. Rather than architected, and designed, and carefully built to last the tests of time.
But I admit, this is largely just a criticism of myself, rather than Nicholas’ point.
As for the rest, I agree completely. Well, maybe not the ‘Use Rust’ point - I'm sure rust is great, but I'm skeptical of rust being the only way to solve the problem in question.