What is our primary use case?
We are a 24-hour online business. We use it for scheduling our on-call engineers and making sure that there is follow-the-sun or round-the-clock coverage for alerting and network operations.
It ingests all our alert paths, i.e., anything that generates an alert of any description, such as, Splunk, AWS, and internal applications. We feed all our events into it, then it generates alerts which need a response from an engineer with a description. Another thing is it is built-in scheduling is pretty much hands-off for our on-call engineers unless somebody goes on holidays. That is the only time that we have to jump in there and make any changes.
How has it helped my organization?
One of the things with our incident flow is that it generates Jira tickets for us. So, its JIRA integration is a critical thing because we need to have that logged for compliance in a separate ticketing system. Having it go into Jira is great, where we can generate hard copies of the alerts and all the events around it. Also, it has the visibility to be able to update one particular location, so you can update Jira and that information goes across to PagerDuty, or you can update PagerDuty and it goes back to Jira. The integration that they have now is great. For example, if you are in the middle of a major event, where you have multiple incidents coming at you, the way it correlates events into a single incident is great.
It reduces the amount of white noise. If something comes through, then it will alert somebody. However, if it's a bit of white noise that comes through at night, then it gets dealt with the next day. Everything is visible to everybody. It's not just a single person getting an SMS, then going, "Oh, I'm not going to worry about that." The visibility to everybody on the team is one of the great things about it because it reduces the white noise.
What is most valuable?
The scheduling feature is the main valuable one for us, because it was previously costing us time. For example, when I was doing the scheduling for the rosters, I would be spending maybe a day out of a month getting the rosters all sorted out. It was rather intense, and a fair chunk of time each month was dedicated to the schedules.
The flexibility in what we can send to it: emails, custom webhooks, and things like that.
We have our production and development environments. If an alert goes offline in the development environment, it's generally treated as a low priority. However, if anything goes down or alerts us from the production environment as a critical or high priority, then an engineer has to stop and fix it straightaway.
What needs improvement?
Because of the way you have to structure the rosters, if an engineer has to go on leave (or something), you can't just go in and reassign/take this person out of all of the different rosters that they're in. You have to go into each of the rosters and take them out. There might be a roster for business hours, after hours rotation, and monitoring deployments. Each time we need to take an engineer out of the pool, e.g., if they're sick or on leave, then we have to go and touch all of those rosters, updating and replacing them. Whereas, if we could just take the person out and have it automatically fill in the rostering, then that would make life a lot easier for managing it.
We have an on-call phone number. However, at the moment, it is routed to a static voicemail. We would actually like to be able to have that phone follow whoever is on-call.
For how long have I used the solution?
I have used it for five or six years, possibly longer.
What do I think about the stability of the solution?
The stability is pretty good. There was one incident where push notifications stopped, but it failed over to SMS and phone calls, so it really didn't make much of a difference. Even then, because we didn't get that many alerts through at the time that they were having push notification issues, it didn't bother us. It was resolved very quickly (in about an hour). The only reason we noticed it was because they told us about it, not because we found it.
We haven't had any issues where PagerDuty caused an impact to us from their maintenance. Using their product, we have been able to set our alerts into maintenance, which is good. There has been no downtime from them being offline, or anything like that.
Before, we would have needed to have done a lot of alert path manual management, then going through afterward, enabling and disabling them. Whereas, within PagerDuty, it's so much easier. You just go in there and click a service on the maintenance, then it automatically does it all for you in the background. We don't have to sit there and think about it. So, it's quick and simple. This is saving us a good hour a month, because that would be one engineer sitting there going through, updating alert paths, etc.
Because we are a payment processor, we can't go offline. We need to be very on the ball and on point with any issues that come up. Having PagerDuty there means we're able to do that.
What do I think about the scalability of the solution?
The scalability is pretty good. I haven't seen anything that would restrict it. Because it's a SaaS platform, you can pretty much plug anything you want into it. I haven't had any restrictions on what I can feed into it from an alert perspective, and we can just keep adding more users as we see fit.
It is an integral part of our operations environment, so we wouldn't want to change or reduce it in any way. If our production environment increased and we had to add more services to it, then it's easy enough to do. It's not as though that is a major problem.
We have eight users in the organization. We also have a couple of stakeholder licenses where we notify stakeholders of major events. These are not actually interactive. They don't get alerts, but they'll get notifications if we allow them, such as, adding them to an incident. They will then get notifications from it, not necessarily alerts. There are internal, as we don't have external clients in that loop because the information management is a something that we keep a tight handle on and that is very manual.
There are two other DevOps engineers who maintain it. There is redundancy if I'm sick. However, I still take the lead on a lot of the stuff.
How are customer service and technical support?
We worked with technical support at one stage when we were trying to get a mail filter. We wanted to set up a complex mail filter with some rules around it. That is when we contacted them, though this is not an ongoing requirement. They were pretty good and very informative. They were to the point, without being blunt.
Which solution did I use previously and why did I switch?
At the time of implementation, the solution was to replace our SMS-based solution, taking the rostering and management of the SMS rotation and making it easier. This was a bunch of homegrown shell scripts that had a little modem card, which would send SMSs to us.
We switch to PagerDuter mainly because of the maintenance and inflexibility of our original solution. We had to maintain it ourselves, paying for the upkeep of the modem, SMS account etc., then making sure that we could send the information to various phones on different carriers. By going to PagerDuty, we were able to come up multiple paths to be able to get those alerts, not just by our SMS.
Previously, we were manually copying and pasting the information. Per incident, it was taking us maybe half an hour, because someone would have to sit there and copy things backwards and forwards, making sure it was all in sync at the end of the incident.
When we first started looking around for a product to replace the existing alerting process, we found this product where alerts were more visible. Then, based on that fact, they were more visible. After a while, this naturally reduced the quantity of alerts by making them more visible. This made it easier to deal with issues because we were able to see alerts. Also, everybody saw them, not just one person.
How was the initial setup?
The initial setup was really easy. We just went in there and clicked a couple buttons, then away we went.
Anytime you need to set something up, the initial setup is great, quick, and easy. It's when you get into some of the nuances, like rostering, where you have to take a person out of a roster, then put them back in. That sometimes adds a bit of complexity. However, the initial setup was one of the things that sold us on it since it was so quick and easy. That is because it is a SaaS-based solution.
When we initially started it, it was like me fiddling around on one weekend. I said to the guys, "Look, I've got this going," then it pretty much went from there. So, it might have been an hour at the most. It did not take long at all.
What about the implementation team?
I was the only person who deployed it.
What was our ROI?
The main flexibility and return on investment we get is that we don't have to do the maintenance on the products that we previously had. It's just seamless. It's like, "Oh yeah, it's reliable. We don't have to do anything else." Whereas, previously it was, "Ah, is the pager actually working?" This reduces worry and everybody's comfortable with the fact that it's going to work. So, the return on investment is more a comfort factor, knowing that we're able to rely on it and not worry that, "Oh, hang on, the alerting's not working," then go and chase up what's wrong with the alerting as well as chase any other problems which come up.
The best thing that we've had is that we get alerted before things happen rather than after the customer's having a problem or notices the problem.
As a result of the reduced white noise, we have reduced engineer fatigue. This means that because the engineers are not tired, their work throughput increases. It is definitely noticeable. If our engineers is working and gets called after hours every night, then when they come in to do their shifts, they're tired because they've had interrupted sleep. Whereas, if we make sure we don't have the white noise and everything else coming through, they're still able to get through their normal workload as well.
What's my experience with pricing, setup cost, and licensing?
If you add more people, then you have to pay more, which is always a thing with the SaaS solutions.
PagerDuty's pricing seems competitive. At one point, we were looking at OpsGenie because part of their current pricing includes the call routing that we wanted to include. It was actually cheaper to get that plus the call routing than it is on PagerDuty at the moment. However, we would have to go and buy an extra module to go with it. What we have at the moment is solid, and it would be a hard sell to say, "We'll go to something else that we're not familiar with."
If we wanted phone calls or additional SMSs, we would have to pitch up for those. They give us so many per month per user, then we have to pay extra if it goes over that.
Which other solutions did I evaluate?
Over the years, we have looked at other solutions: OpsGenie and VictorOps. There was another one, but they faded away. We were also using Pingdom at one point. Some of them are still a little bit green in this space. They're definitely coming up to speed.
So far, we're settled on PagerDuty because they were the leader and only one around at the time we were evaluating solutions. Since then, we've started looking at other products just to make sure that they're still on point with what we need.
The alerting functionality is not too bad. I have evaluated other competitive products for the way you can set different types of alerts, e.g., for non-critical or critical. PagerDuty will alert you differently based on those settings, which is an advantage that we like. It will also try multiple paths so you can set it up to email you the alert, send you an SMS, phone you, or just a push notification to your phone. One of those four mechanisms means the engineer will get notified one way or another. If that doesn't work, it automatically escalates to the next person in the alerting path.
We do have a project in the pipes for probably the beginning of next year to go through and do another review to make sure that the solution has everything there. We also want to do comparisons for what other options are available, make sure the pricing is still competitive, what's on offer, and so on.
What other advice do I have?
For whatever solution you have for alerting, and it being such a critical role in incident management, you need to be able to rely on it. PagerDuty allows us to do that.
Ensure you sit down and identify what you want in any alerting platform, whether it's PagerDuty or OpsGenie. Sit down and define what you want, particularly around your scheduling, what alerts you want to be able to ingest or handle, who you want to be able to process or send those alerts to, and any other possible bits and pieces in there that you may need before you sit down and look at an alerting platform of any description. Because sometimes, depending on what it is, there may be another way of doing it when you actually go and talk to the salespeople or pre-sales engineers. They'll go, "Oh, well, you can do this, this, or this." This will avoid bright light problems where, "Oh, that's a nice, shiny light. Yeah, we need that." You actually have in front of you what you need, not necessarily what they're trying to sell you.
We have looked at the solution’s analytics, but haven't gone much into them. At the time that we were looking at it, we didn't see any real benefit to it since we are only a small team. If you would look at a larger organization, you would get more benefit out of it. However, because we're such a small team, everybody knows how many alerts are coming through. It's not as though we need to do a full-on detailed, analytical review of things.
I would rate this solution a nine out of 10. It is a reliable solution that works.