What is our primary use case?
The most common use case is the result of alerts coming from a monitoring system, like New Relic or Nagios, alerts that we define as critical. They are alerts where we need someone to get on a bridge or to start working on them during the night. Once such an alert is firing, it fires a PagerDuty alert and it triggers the current on-call who is scheduled in PagerDuty's schedule.
The on-call person acknowledges the alert and looks into it to understand what is going on and to update, via PagerDuty, what the status is. The update will be sent to all the groups that are part of the PagerDuty schedule until the issue is resolved.
We mostly integrate it with other monitoring tools like New Relic or Nagios, or we are using their email integration for on-call processes to page people in groups. We also use it for Sev 1 issues that are coming from alerts from New Relic or from Nagios or other monitoring systems.
How has it helped my organization?
When my team is needed immediately, instead of people trying to catch someone on the phone or by email during off-hours, it's easier to use this kind of service. People can just fire an email to an on-call email address and it will catch the current on-call who knows he has to be available at that time.
Also, because we are not a large group and we do not have our eyes on glass 24/7, we need to have one on-call available for several projects. The current on-call may not always understand why a project is firing an alert, but he will know how to easily reach the person who is the focal point for the project in question.
Also, most of the time, the teams that want to engage my team are not so fluent in English and it's easier to understand someone via email. But my team is not always in front of their emails. PagerDuty is doing the bridging between the email being sent that asks for help and those who can provide the help. PagerDuty calls our on-call and he answers the phone and understands that there is a real issue. After that, he reads the email or looks in the body of the pager message and gets an understanding of what the issue is, and engages the focal point.
What is most valuable?
It's a tool for incident management, to help us understand what happened during an alert. A cool feature is that it helps us to understand the flow of the alert. If the alert was coming to the current on-call and he didn't catch the call or didn't notice it for any reason, it starts being escalated automatically, according to the escalation schedule, or to other teammates. You can see the flow very easily on your phone or via the website, if you want to do a post-mortem.
The solution’s alerting functionality is very good. It does the job. It's not that it only works sometimes. It works every time it needs to. It also knows how to close alerts that are closed from the monitoring system and you can easily close and acknowledge alerts via your phone even if you don't have the mobile app. You can do it with an SMS. So at 2:00 a.m., it's very easy to navigate an incident.
The email-for-alerting integration is also valuable. If there is a team that needs my team, they can easily send an email with the subject and why they want us to be on board and that we should start investigating an issue. Instead of how it worked in the past, when they would call the on-call number and start talking and try to explain what is going on, they just send an email and it pages the current on-call who is scheduled. It's very nice and easy.
While using PagerDuty hasn't resulted in a decrease in issues, it has allowed us, in combination with the monitoring systems, to know about issues before customers are alerting us. If a monitoring system was only sending emails, those emails could be missed among thousands of emails. But if we create alerts in New Relic, which integrates with PagerDuty, and we get a call from PagerDuty, it's much better. By not missing an email, it allows us, during working hours, to engage with other teams or to resolve the issue without causing problems to our customers. Issues can be resolved before someone notices.
It is more the monitoring systems that can point out problems to be addressed before they become worse, but those systems are not really able to do more than send us an email. Without the integration to PagerDuty, issues that are defined as critical could be missed.
What needs improvement?
There is room for improvement with the time schedule. The way the schedule currently works is you assign all the team members in one schedule and it automatically spreads them around throughout the schedule. Due to that, I need to do extra work to adjust it, due to specific team needs or how I'm staffing my team. It would be better to be able to edit the schedule and place my team members where I want, or at least to have that option in addition to the automatic process. I find myself redoing the schedule often. Every month I need to make another schedule. It's not so bad but it could be improved.
For how long have I used the solution?
I've been using PagerDuty for more than three years.
What do I think about the stability of the solution?
It's always up. We haven't faced any issues with the PagerDuty platform. In that sense, it hasn't affected our operations at all. But if there were an issue with PagerDuty, I can see how it might be like Murphy's Law and that the issue would happen at a time when we needed PagerDuty to be working. That would not be good for a group like ours that operates several main projects, projects which impact a lot of customers all over the world. So the availability is very important for us.
What do I think about the scalability of the solution?
We haven't had any issues with scaling up.
Currently we don't have plans to expand our usage, we are good with what we have. But we are using it very often, with like the alerts, mostly on the weekend. And when there are crises we get alerts that come through PagerDuty.
How are customer service and technical support?
I myself have not had to work with support very much, but I understand from my team that they are good and have solutions. Someone in particular from my team had to work with their technical team and they helped him a lot. If we find issues or we have suggestions for improving the solution, they're very responsive.
How was the initial setup?
I wasn't involved when they implemented PagerDuty, but I don't think the company had to implement anything here. It's a SaaS service and the integrations are through integration keys, and that is something I do for each project. It's simply that you have service, you can log in, and do what you want to do.
They just gave us the license key, we got access, and we brought our team into PagerDuty by sending them each an email to log in.
The integrations with our monitoring tools took five minutes. It's very easy. And they have a lot of integrations. If you have a specific tool that you need to integrate with, you can always use their email integration, where your tool will send an email to a specific address and PagerDuty will fire the alarm.
And we don't need to maintain PagerDuty. It's a SaaS service so the only thing we need to do is create a schedule and, if there is a new integration, to set up the integration. It's not something that you need to be doing every day.
What was our ROI?
I think we have had a return on our investment but I can't give you actual numbers. It has prevented a lot of potential crises for our customers. We catch things before anyone else knows about them. We are based in Israel while 90 percent of our customers are in the U.S. So we know about customer-facing issues, local time, before they are felt in the U.S. The main functionality is that it calls us for critical issues and outages. It's very helpful and has reduced customer complaints and issues that could cause us to struggle.
What other advice do I have?
I don't use the solution's analytics very much. I only use it at the end of the year if management wants to see its usage and the capacity of my team.
We have about 60 to 80 users of the solution. Most of them are support engineers, developers, and some managers.