It’s 3:26 in the morning, and the cellphone on your nightstand is blaring the awful alert sound you knew would for sure wake you up. And it has, for the second time tonight. You awkwardly paw at your phone to try to shut it up before it wakes your partner (too late), then read the notification through bleary eyes. “At least it’s not the same alert waking me up again,” you think to yourself.
You trudge to your laptop in the other room, wake it up, and start trying to find some logs to see if you can figure out why this service you know little about decided to answer the last few health check pings with a 503 Service Unavailable response. It takes you a good 10 minutes of fumbling to sort out the right incantation in your log aggregator to find messages from the service in question.
Sure enough, you see a couple of 503 responses, but the last several requests have all been 200 OK. In the time it took you to get to your computer and find the logs, whatever was causing the 503s has sorted itself out. As you scroll back a bit to make sure there weren’t any earlier blips, you see nothing but health check pings since the last user request around 7:30p the night before. There’s no sense trying to find the problem right now for a service nobody’s even using at this hour.
“So glad I got out of bed for that,” you grumble to yourself as you trudge back to bed. You’ve only got a couple hours left before your alarm goes off at 6 to help get the kids out the door to school, and you know it’s going to come way too soon.
On call is a necessary part of the job for those of us who build and operate complex systems at scale. These systems fail in novel and unexpected ways, and with users depending on them, it’s up to us to keep them running. But so many on call rotations are unintentionally harmful to the engineers participating in them, resulting in evenings interrupted, hours of sleep lost, and so much stress and anxiety that could all be avoided.
While the story above isn’t true, it closely mirrors situations I’ve experienced firsthand and seen while coaching teams in improving their on call practices. If you’ve spent any time supporting a production system, there’s likely something in that story that resonates with you as well. The idea that on call is painful is so ingrained in engineering culture that we often just accept the pain. So instead of doing that, let’s look at some common on call antipatterns and some simple things we can do to alleviate their common pitfalls.
Time outside of work is for friends, family, fun, and rest. Every time an on call engineer’s pager goes off, they’re being asked to give up some of that time for the good of the company. Most engineers will gladly agree to that…to a point. After all, some of their compensation (in the form of equity and bonuses) is predicated on the company’s success. But teams sometimes have a tendency to overrely on that generosity and default to paging anytime something might be wrong. Couldn’t hurt to have someone check it out to make sure, right?
But soon the pager is going off every night, sometimes multiple times a night. These teams move to 24h on call shifts because that’s the longest anyone can stand to be on call. They try to improve the situation page-by-page, but it’s a giant game of whack-a-mole to try and reduce the page volume even a little bit.
If the pager’s going off that frequently, it’s tempting to jump to the conclusion that your system is not reliable enough. And that might be part of it. But you’re likely also paging for a bunch of things that don’t actually impact your users in any meaningful way.
You should set an incredibly high bar for what’s allowed to wake someone up in the middle of the night, and you should focus on user impact in setting that bar. If your users can still do the primary things they need to do using your service, then whatever functional degradation they’re experiencing can likely wait until morning for a resolution. You should also configure your alerting system to wait long enough to be relatively sure a service actually needs intervention before you page, because it’s incredibly frustrating to be woken up in the middle of the night just to see that a system has recovered on its own.
If you’re trying to fix a system that’s paging too often, there’s a couple of steps you should take:
Every engineer remembers the first time their pager went off in the middle of the night and they were up, by themselves, trying to figure out what was going on and scared to death that their attempts to fix it might make things worse. And most of us have had that exact thing happen at some point, taking a minor issue and accidentally turning it into a much larger one with a well-intentioned but incorrect attempt at a fix.
When the pager goes off, too often engineers find themselves thrown into the deep end without a lot of support. They’re expected to use their knowledge of the system and systems in general to figure out what’s going on and fix it. This makes it hard for earlier career engineers to confidently respond to pages, and it’s more stressful than it has to be for engineers at any level. It can feel scary and unsafe, and that drives us to make decisions based on gut and intuition rather than knowledge and facts.
Pilots have a checklist for everything. Getting ready for departure? Follow the checklist. Engine out in flight? Follow the checklist. Landing gear stuck? Follow the checklist. The reason pilots are so reliant on checklists is that it reduces their decision making load while under stress and makes it less likely that they’ll make a potentially dangerous situation worse with a bad decision. Sound familiar?
The analog in the software world is the runbook. A runbook is a compilation of common tasks and troubleshooting steps for a given system. In the story above, a runbook could’ve saved the 10 minutes it took to find the right query to get logs for that service. It also would’ve contained the most common failure scenarios and might have explained exactly why that 503 was happening and what to do about it.
You should have a runbook for every service your team operates. Think about how you’d gather context about the state of the service from the observability and log aggregation platforms your team uses, and record those steps. Think about all the ways the service is likely to fail and write down what steps you’d take in each situation. Think about what dependencies a service has and how to check if they’re healthy or might be part of the problem during an incident. It’s not a ton of work to put this information together, but if you take the time to do it, anyone who has to troubleshoot your system will feel much safer and more confident doing so, especially by themselves in the middle of the night.
The worst pages are the ones you can’t do anything about. Maybe they’re for a service you don’t have enough knowledge of to troubleshoot beyond the basics or maybe you don’t have the system access you need to be able to figure out what’s going on. In either case, you’re up in the middle of the night playing human router, waking someone else up who either has the knowledge to troubleshoot the problematic service or can grant you the system access you need to troubleshoot it.
It’s even worse when you don’t have the agency or ability to work on the service that paged you. When something wakes you up in the middle of the night, you’re pretty strongly incentivized to make sure that thing can’t wake you up again. But if that system or service is owned by another team, sometimes it’s hard to take that energy and put it into making things better.
When you have the time, agency, and knowledge to address the issue that paged you, there’s a virtuous circle that ensures your motivation gets put to good use and that paging issues get fixed, both in the moment and so that they don’t recur. It’s easy for a disconnect to happen as an organization grows, putting more access controls in place for production systems and spreading work out across multiple teams. Letting this disconnect go unaddressed can quickly make on call shifts much more onerous than they need to be.
Most companies wait too long to break their big on call rotation into multiple smaller, service- or area-specific rotations. They do this with good intentions, trying to keep engineers from needing to be on call too often, but waiting too long results in breaking the virtuous cycle because engineers who only regularly work on a portion of a system are being asked to provide production support for the whole system. Breaking into smaller rotations means people are on call more often, but their on call shifts will be quieter and less stressful because they’re likely to know something about the system paging them and to be able to implement any remediation necessary to keep the problem from happening again.
Similarly, most companies go far too long allowing engineers unfettered access to production systems. When they finally do start locking things down to improve their security posture, getting access to systems is often onerous, requiring manual processes to get access to production and adding friction to the incident response process. This is where a tool like Sym can help, allowing teams to build access workflows that improve their day-to-day security posture while still providing quick access to production systems when it’s needed.
It’s impossible to run a production system at scale without some kind of on call rotation to support it. But that doesn’t mean the on call rotation has to be painful. Avoiding the antipatterns above will go a long way towards making it better, but you and your team should also be regularly discussing your on call practice and making sure that it evolves and grows alongside the systems and engineering teams it supports.
Don’t settle for painful on call shifts just because it’s all you’ve ever experienced. It can and should be better, and it’s worth the work to get there.