Heuristics for successfully automating pages

At Redox we have been using PagerDuty for almost the entire history of the company. The ability to automatically instantly notify engineers when issues arise is core to our support model. But turning your developers loose on the Pagerduty API can cause headaches, especially if you’re not striking the right balance. Andy, Blake, and I have contributed to some heuristics we use to hone our development of new pages, as well as a list of anti-patterns to recognize and account for.

Heuristics

Who

Who gets the page?
Who is required to resolve the issue (Redox and/or Customer)?
How does the page fit into the bigger picture? (use the correct service!)

Traits

Page goes to the person who can most efficiently solve the problem
All relevant contacts are documented
Page goes to customers as well, where appropriate
The correct PagerDuty service is used for triggering the page

What

What are the details of the problem?
What additional research does the pagee need to do after they get the page?
Does the system detect when a problem has automatically fixed itself?

Traits

Page contains all details needed to triage in the page itself.
Page is documented on a wiki page.
Page can automatically resolve itself.

When

Does the page need to go off in the middle of the night? Can it be worked during business hours?
What kind of response time does the page need?
Does the page become irrelevant past a certain point?

Traits

Does the page need to exist?
Uses an appropriate service – this is a PagerDuty construct
PagerDuty service uses appropriate escalation paths – all the way up the phone tree

Where

Where does the pagee need to go to solve the problem? Dashboard, AWS, database, DataDog, New Relic, Logs, E-mail?
Does the pagee have access to all the tools they need to solve the problem?

Traits

Provide direct links to relevant reports, dashboard pages, etc.
Document what services need to be used on the wiki

Why

Can the system be designed to handle the page automatically? Now? In the future?
At what point do we stop paging for this issue?

Traits

Page has a clear purpose, resolution path, and “endgame” – will we still page for this in 1 year? 2 years? 5 years?
Any future enhancements documented as GitHub issues

Anti-Patterns

We’ve noticed some patterns that should be avoided when dealing with the Pager. Here, we list some of the largest offenders.

Pager relay

Person A gets paged, which immediately caused Person B to get notified. This potentially doubles the number of people that need to be woken up in the middle of the night. Of course, if a developer doesn’t know what to do for a particular page, they should absolutely reach out. However, if this becomes a pattern, we should ask ourselves the following questions:

Could we improve the documentation and/or training around this page?
Is this page being routed to the appropriate person or team?

Pager Snooze-fest / PagerDuty as a record-keeper

When we snooze a page, it’s generally because we’re waiting for something external to happen. If this becomes a pattern where we wait on a page without taking any action, the page should be re-examined. Ideally, pages go off if there is action to be taken – if there is no action, should we downgrade the severity of the alert?

High-frequency pages

This is particularly dangerous because it can lead people to become callous to the pager. If a page is going off too frequently, we should ask:

Can we downgrade this alert to a Slack message?
If we cannot downgrade this alert, should the underlying issue be on a delivery team’s roadmap to fix? Perhaps it is an automation that needs to be written, or perhaps it’s an instability – either way, it should be prioritized. The alternative is pager fatigue.

Final Thoughts

Page too much, and you risk burnout, frustration, and being numb to real issues. Page too little and you miss important events that someone should be looking at. We hope this post has been a good starting point for finding that perfect middle ground.

Data interoperability

Finding the future of interoperability with Redox: Part 2 – HL7 v2 to FHIR (and back again)

Brendan from our Solutions Engineering team explores the current state of HL7® v2 and HL7® FHIR®, and how Redox can drive a FHIR-based application experience even when data exchange partners prefer to use HL7 v2.

By Brendan Iglehart - March 29, 2024

Data interoperability

Finding the future of interoperability with Redox: Part 1 – Bulk FHIR

Brendan from our Solutions Engineering team gives an overview of the current state of bulk FHIR support among EHR vendors and outlines Redox’s capabilities for bulk data extracts via EHR FHIR APIs.

By Brendan Iglehart - March 14, 2024

Data interoperability

Crystal Ball Chronicles: Reflecting on Out of Pocket Health’s 2024 Predictions

Pryce Ancona, a Redox data engineer in the trenches, dissects Out of Pocket Health’s 2024 predictions. Did they foresee what healthcare leaders are demanding? Which strategies does the current healthcare economy reward? Plus, how are regulations shaping innovation?

By Pryce Ancona - February 1, 2024

Developer Tech Talks

Pager anti-patterns and heuristics for successfully automating pages

June 5, 2018

Heuristics

Who

Traits

What

Traits

When

Traits

Where

Traits

Why

Traits

Anti-Patterns

Pager relay

Pager Snooze-fest / PagerDuty as a record-keeper

High-frequency pages

Final Thoughts

Finding the future of interoperability with Redox: Part 2 – HL7 v2 to FHIR (and back again)

Finding the future of interoperability with Redox: Part 1 – Bulk FHIR

Crystal Ball Chronicles: Reflecting on Out of Pocket Health’s 2024 Predictions

Products

Explore Redox

Audience

Topics

Media

Developer Tech Talks

Pager anti-patterns and heuristics for successfully automating pages

June 5, 2018

Heuristics

Who

Traits

What

Traits

When

Traits

Where

Traits

Why

Traits

Anti-Patterns

Pager relay

Pager Snooze-fest / PagerDuty as a record-keeper

High-frequency pages

Final Thoughts

Stay in the know! Subscribe to our newsletter.

Related Posts

Finding the future of interoperability with Redox: Part 2 – HL7 v2 to FHIR (and back again)

Finding the future of interoperability with Redox: Part 1 – Bulk FHIR

Crystal Ball Chronicles: Reflecting on Out of Pocket Health’s 2024 Predictions