At Redox we have been using PagerDuty for almost the entire history of the company. The ability to automatically instantly notify engineers when issues arise is core to our support model. But turning your developers loose on the Pagerduty API can cause headaches, especially if you’re not striking the right balance. Andy, Blake, and I have contributed to some heuristics we use to hone our development of new pages, as well as a list of anti-patterns to recognize and account for.
- Who gets the page?
- Who is required to resolve the issue (Redox and/or Customer)?
- How does the page fit into the bigger picture? (use the correct service!)
- Page goes to the person who can most efficiently solve the problem
- All relevant contacts are documented
- Page goes to customers as well, where appropriate
- The correct PagerDuty service is used for triggering the page
- What are the details of the problem?
- What additional research does the pagee need to do after they get the page?
- Does the system detect when a problem has automatically fixed itself?
- Page contains all details needed to triage in the page itself.
- Page is documented on a wiki page.
- Page can automatically resolve itself.
- Does the page need to go off in the middle of the night? Can it be worked during business hours?
- What kind of response time does the page need?
- Does the page become irrelevant past a certain point?
- Does the page need to exist?
- Uses an appropriate service – this is a PagerDuty construct
- PagerDuty service uses appropriate escalation paths – all the way up the phone tree
- Where does the pagee need to go to solve the problem? Dashboard, AWS, database, DataDog, New Relic, Logs, E-mail?
- Does the pagee have access to all the tools they need to solve the problem?
- Provide direct links to relevant reports, dashboard pages, etc.
- Document what services need to be used on the wiki
- Can the system be designed to handle the page automatically? Now? In the future?
- At what point do we stop paging for this issue?
- Page has a clear purpose, resolution path, and “endgame” – will we still page for this in 1 year? 2 years? 5 years?
- Any future enhancements documented as GitHub issues
We’ve noticed some patterns that should be avoided when dealing with the Pager. Here, we list some of the largest offenders.
Person A gets paged, which immediately caused Person B to get notified. This potentially doubles the number of people that need to be woken up in the middle of the night. Of course, if a developer doesn’t know what to do for a particular page, they should absolutely reach out. However, if this becomes a pattern, we should ask ourselves the following questions:
- Could we improve the documentation and/or training around this page?
- Is this page being routed to the appropriate person or team?
Pager Snooze-fest / PagerDuty as a record-keeper
When we snooze a page, it’s generally because we’re waiting for something external to happen. If this becomes a pattern where we wait on a page without taking any action, the page should be re-examined. Ideally, pages go off if there is action to be taken – if there is no action, should we downgrade the severity of the alert?
This is particularly dangerous because it can lead people to become callous to the pager. If a page is going off too frequently, we should ask:
- Can we downgrade this alert to a Slack message?
- If we cannot downgrade this alert, should the underlying issue be on a delivery team’s roadmap to fix? Perhaps it is an automation that needs to be written, or perhaps it’s an instability – either way, it should be prioritized. The alternative is pager fatigue.
Page too much, and you risk burnout, frustration, and being numb to real issues. Page too little and you miss important events that someone should be looking at. We hope this post has been a good starting point for finding that perfect middle ground.