Reduces white noise, which has reduced engineer fatigue
What is our primary use case?
We are a 24-hour online business. We use it for scheduling our on-call engineers and making sure that there is follow-the-sun or round-the-clock coverage for alerting and network operations. It ingests all our alert paths, i.e., anything that generates an alert of any description, such as, Splunk, AWS, and internal applications. We feed all our events into it, then it generates alerts which need a response from an engineer with a description. Another thing is it is built-in scheduling is pretty much hands-off for our on-call engineers unless somebody goes on holidays. That is the only time… more »
Pros and Cons
"It reduces the amount of white noise. If something comes through, then it will alert somebody. However, if it's a bit of white noise that comes through at night, then it gets dealt with the next day. Everything is visible to everybody. It's not just a single person getting an SMS, then going, "Oh, I'm not going to worry about that." The visibility to everybody on the team is one of the great things about it because it reduces the white noise."
"Because of the way you have to structure the rosters, if an engineer has to go on leave (or something), you can't just go in and reassign/take this person out of all of the different rosters that they're in. You have to go into each of the rosters and take them out. There might be a roster for business hours, after hours rotation, and monitoring deployments. Each time we need to take an engineer out of the pool, e.g., if they're sick or on leave, then we have to go and touch all of those rosters, updating and replacing them. Whereas, if we could just take the person out and have it automatically fill in the rostering, then that would make life a lot easier for managing it."
What other advice do I have?
For whatever solution you have for alerting, and it being such a critical role in incident management, you need to be able to rely on it. PagerDuty allows us to do that. Ensure you sit down and identify what you want in any alerting platform, whether it's PagerDuty or OpsGenie. Sit down and define what you want, particularly around your scheduling, what alerts you want to be able to ingest or handle, who you want to be able to process or send those alerts to, and any other possible bits and pieces in there that you may need before you sit down and look at an alerting platform of any…