Engineering

Pager anti-patterns and heuristics for successfully automating pages

Posted June 5, 2018
By Nick Hatt

At Redox we have been using PagerDuty for almost the entire history of the company. The ability to automatically instantly notify engineers when issues arise is core to our support model. But turning your developers loose on the Pagerduty API can cause headaches, especially if you’re not striking the right balance. Andy, Blake, and I have contributed to some heuristics we use to hone our development of new pages, as well as a list of anti-patterns to recognize and account for.

Heuristics

Who

Traits

What

Traits

When

Traits

Where

Traits

Why

Traits

Anti-Patterns

We’ve noticed some patterns that should be avoided when dealing with the Pager.  Here, we list some of the largest offenders.

Pager relay

Person A gets paged, which immediately caused Person B to get notified.  This potentially doubles the number of people that need to be woken up in the middle of the night.  Of course, if a developer doesn’t know what to do for a particular page, they should absolutely reach out.  However, if this becomes a pattern, we should ask ourselves the following questions:

Pager Snooze-fest / PagerDuty as a record-keeper

When we snooze a page, it’s generally because we’re waiting for something external to happen.  If this becomes a pattern where we wait on a page without taking any action, the page should be re-examined.  Ideally, pages go off if there is action to be taken – if there is no action, should we downgrade the severity of the alert?

High-frequency pages

This is particularly dangerous because it can lead people to become callous to the pager. If a page is going off too frequently, we should ask:

Final Thoughts

Page too much, and you risk burnout, frustration, and being numb to real issues. Page too little and you miss important events that someone should be looking at. We hope this post has been a good starting point for finding that perfect middle ground.