What is our primary use case?
The first use case is technology operations tools. We are a best of breed monitoring shop. We have all kinds of tools that monitor things, like storage, network, servers, applications, and all types of stovepipes that do domain specific monitoring. Each one of those tools was sold to us with what they called a single pane of glass for their stovepipe. However, none of the tools are actually publishing or sharing any of the events that they have detected. So, we have been doing a poor job of correlating events to try and figure out what's going on in our operations.
Our use case was to leverage that existing investment. For about a year, we have been proving that we can build publishing adapters from these legacy monitoring tools which are each valid in their own right, like storage monitoring tools, network monitoring tools, and application monitoring tools (like Dynatrace), and more modern than other ones. We have been building publishing adapters from those things so we can transport those events to an event aggregation and event correlation service. We're still trying to run through our list of candidates for what our event correlation will be, but the popular players are Splunk, Datadog, and Moogsoft, then ServiceNow has its own event management module.
From an IT systems management perspective, our use case is to have a common event transport fabric that spans multiclouds and is WAN optimized. What is important for me is topic wildcarding and prioritization/QoS. We want to be able to set some priorities on IT events versus real business events.
The second use case is more of an application focus. I'm only a contributor on the app side. I'm more of an infrastructure cloud architect and don't really lead any of the application modernization programs, but I'm a participant in almost all of them. E.g., we have application A and application B side by side sitting in our on-prem data center, and they happen to use IBM MQ Hub to share our data as an integration. Application A wants to move to Azure. They are willing to make their investment to modernize the app, not a forklift, but some type of transformation event. Their very first question to us is, "I need to bring IBM MQ with me because I need to talk to app B who has no funding and is not going to do anything." Therefore, our opening position is, "Let's not do that. Let's use cloud-native technology where possible when you're replatforming your application. Use whatever capability you have for asynchronous messaging that Azure offers you. Let's get that message onto the Azure Event Hub. Don't worry about it arriving where it needs to arrive because we'll have Solace do some protocol transformation with HybridEdge, essentially building a bridge between the Azure Event Hub and MQ Hub that we have in our data center."
The idea is to build bridges between our asynchronous messaging hubs, and there's only a small handful of them, where Azure Event Hub is the most modern. We have an MQ Hub that runs on a mainframe and IBM DataPower appliances that serve as our enterprise service bus (ESB). Therefore, if we build bridges between those systems, then our app modernization strategy is facilitated by a seamless migration to Azure.
The most recent version is what we installed about three weeks ago.
The solution is deployed on Azure for now. We will be standing up some nodes in our on-prem data centers during next phase, probably in the next six months.
The plan is to use event mesh. We're not using it as an event mesh yet, as we are only deployed with Azure. We want to position a Solace event mesh for enterprise, but we're just now stretching into Azure. We're a little slow on the cloud adoption thing. We've got 1200 applications at CIBC with about four of them hosted in clouds: one at AWS and three at Azure. So, we're tiptoeing into Azure right now. We're probably going to focus our energy on moving stuff into Azure. However, for now, because the volume is so low on stuff that's outside of our data center, the concept of a mesh has been socialized. There's not a ton of enthusiasm for it, even though I might be shouting from the rooftops saying, "It's a foundational capability in a multicloud world." It looks like we're putting that funding on the back burner for using it as an event mesh.
How has it helped my organization?
This solution has increased our application design productivity compared to other solutions. There is a ton of momentum in our application development space for leveraging Dynatrace with Solace's monitoring tool. We have made the investment in getting Dynatrace to publish events that it detects, mostly application performance related events. The app development teams have taken a liking to implementing the application monitoring tool early in their development cycles, maybe not in development, but in their performance testing cycles. We can practice what a code drop stack shift would look like if they're shifting from stack A to stack B or if they're doing rolling reboots on some of their app servers as they're doing upgrades. We get to exercise that and see what the monitoring patterns look like correlated with servers going up and down along with web services coming up and down. That's been helpful to the development community to see that automation occur in mid-environments.
There have been quite a few incidents where a test infrastructure has become unavailable because of some change going on and the app developers aren't on the nut for fixing the problem in the UAT environment because they know the outage was caused by the fact that they did a code drop 20 minutes earlier, which is a legitimate server outage. We are seeing some benefit, but it's more of an optimization of incident management resources. E.g., we have somewhere between five and 10 Internet-facing applications, and when something goes bump on the firewall that's behind them, we have DMZs (or different zones) where we put our web tier and app tier. Therefore, when something goes bump in our network tier, we got 10 application teams that are all fired up, and say, "What's going on?" Then, they all spin up their own tech bridges. Meanwhile, the firewall guys are working on a problem that we just don't know about. So, we have wasted a lot of time and energy trying to figure out things that aren't our problem. The bad scenario hasn't happened to us in production, but in our test environment, it's happens once where a couple of app dev teams have been able to stand down because we were correlating events correctly.
We struggle with mean time to resolution on things. We do have a lot of change control rigor, but the solution hasn't changed our organization yet. The idea is when we're getting the events for our service provider of operating systems, servers, and storage network correlated intelligently together with our application changes, application performance monitors, and application availability monitoring tools, then we'll make more intelligent decisions about root cause, where problems lie, and be able to react more intelligently. This will reduce mean time to resolution, but we're not there yet.
The division who has been using Solace for years has a mature costing estimator model for internal projects. That model certainly will be leverageable for the technology operations guys. We haven't crossed that bridge yet because we're still in PoC mode. It's very likely that once we hit prod, we'll have ease of solution design when we have a protocol-agnostic message transport in place, and that our solutions will be easier to craft and give cost estimates.
It is easy for architects and developers to extend their design and development investment to new applications using this solution. In our architecture practices, we are always documenting compositions. We care a lot about the data exchanges between applications or the integrations. We have a lot of contractors and other integrations that we care about. Having transmission facilitators definitely makes the architect's life a lot easier when we just put a message on the queue and it's going to get transported by the facilitators to wherever it needs to go. It is definitely easier when we have Solace and an event mesh up and running. Today, when we have integrations that don't leverage those transmission facilitators, like an MQ Hub or Solace event mesh, those integrations are much harder to get approved because we have to dive into the security, access controls, encryption, and all that other stuff.
What is most valuable?
The most useful features has been the WAN optimization and probably the HybridEdge, which requires some third-party adapters or plugins. The idea that we can position Solace as a protocol-agnostic message transport fabric is key to our company having all manners of asynchronous messaging protocols from MQ, Kafka, JMS, etc. I really like the WAN optimization: Send once over a WAN, then distribute locally as many times as there are subscribers.
I don't think we have yet unleashed the full potential of topic wildcarding. That is a silver bullet that we haven't yet maximized the value on because we don't have a ton of subscribers yet. Coming up with a topic naming convention in our large company has been difficult. However, once we start forking data over to some of our data lakes, enterprise data hub, and security event depositories, it will become a useful feature in the future.
What needs improvement?
The storytelling about the benefits needs improvement. We have four major lines of business in our company. Our retail, capital markets, and internal corporate center lines of business along with technology operations, which is more of a cost center. Technology operations are not innovators, but more a keep the lights on arm of the business. One of the areas of improvement would be if we could tell the story a bit better about what an event mesh does or why an event mesh is foundational to a large enterprise that has a wide diversity of applications that are homegrown and a small number off the shelf. I wish we were better able to tell the story in a cohesive way to multiple lines of business, but that's more of a statement of our own internal structure and how we absorb or adopt new technology than it is about Solace or the product itself.
It been a bit of a tough slog to try and get everybody to see event meshes are foundational in a multi-data center, multicloud landscape, when we're not there yet. Our company has most of our applications in two data centers that are close to each other. There is no real geo-redundancy, but everything we've ever done has been on-prem with only a small handful of Azure adoptions. Therefore, having folks see the benefit of an event mesh has been tough. I wish we could improve our storytelling a little bit.
We have struggled in a sort of perpetual PoC mode internally. This is no fault of Solace's. It's just that the only executive looking to benefit here is our technology operations team, and they have no money for investments. They're a cost center internally, so they have to be able to make the case that we're going to improve efficiency by leveraging this tech. Thus, the adoption has been slow.
For how long have I used the solution?
We have three different lines of business in our company. One of them has been using Event Broker for about six or seven years.
Personally, I have been engaged in a proof of concept for about 18 months.
What do I think about the stability of the solution?
Solace has been incident free in HA deployment for seven years. I did an analysis before we started our PoC for the technology operations team, looking for a lot of incidents. One of the pieces of work I did internally was to figure out our app stabilization, and I couldn't find anything Solace related in terms of the bumpiness. It had a clean track record, unlike our DataPower appliances which have gotten us in the newspapers a couple of times in the last three years.
When I did my analysis, I found a lot of dependencies on our file transmission hub and the product that we use. I found a lot of victims of our DataPower appliances. I found no victims nor incidents related to our Solace hardware appliances under the coverage. There was not a single incident in six years. I went back to the well to try and see if I can find more, but I can speak to the hardware appliances and how stable they have been. They were only deployed within a single line of business, so it didn't have the complexity of an enterprise shared service in multi-LOB mode. However, the stability has been really good with a good track record.
What do I think about the scalability of the solution?
If we deploy this the right way, we get a presence on each cloud at each data center and the full mesh effect. Plugging them into each other or making them part of the same ecosystem so they are aware of each other is not complicated for the guy whom we have working on this. He's not deploying it that way yet for our technology operations use case.
As we start to generate a little more momentum for our event correlation engine, we're probably going to uplift ourselves to a Tier 1 capability that has more of these nodes deployed throughout our various geographies around the globe. But, for now, it's only in one region of Azure Canada Central.
The group who has been using the solution for six or seven years has the physical appliances. Within the last two years ago, they just refreshed on physical appliances again. We're probably not going to do it all. The physical appliances have been in the control of a single line of business in our company who have been able to self-manage. There wasn't really an enterprise-wide adoption that required a lot of coordination in our change process. We've done a lot of change management rigor in our company, so when a service is wholly contained within a particular line of business, then the ease of getting stuff done is a lot higher.
We have a small set of publishers, probably eight or 10 publishers, with maybe two subscribers. We haven't had the need to get into a whole bunch of granularity. The scope of our program: All publishers are sending to the two subscribers. There is really not a need to get very granular about who sends to where.
Today, in IT operations, the usage number is still zero because we are not live. The benefit will be probably 2000 operations staff across our own company and our service provider DXC. It's a 50/50 split. DXC has hundreds of guys doing incident management and operations for servers and below. We have retained services in the application space who are application operators and security operators. Those are retained people who will be working more efficiently as well.
How are customer service and technical support?
I have not personally dealt with their technical support. They are always responsive. I know I like to talk with them on emails that go back and forth, but it's really about sales, e.g., trying to get statuses on our proof of concept and how it's going. We've not had any reason to reach out to them for tech support issues.
Occasionally, we have needed help for HybridEdge when we were trying to build a new protocol transformation adapter, then we will reach out to them. However, this is not in incident mode. It's always in a sort of a how-to mode for a PoC. We have never had to reach out to them for urgent requests.
Which solution did I use previously and why did I switch?
We have protocols specific message transport passport hubs, like SFTP hub or IBM MQ Hub, but we never had a tech that has been protocol-agnostic. Therefore, the solution is kind of new.
Our IBM DataPower appliances have had the capability to do protocol transformation, but we've never done it. We've always just used it for REST and XML type stuff.
Our enterprise data hub has been essentially a big data lake for business data, customer information, etc. They are in year three of the enterprise data hub program. For the first three years, they had been receiving data only by file transfer, which was yesterday's data at best. Only because I'm a participant in different projects, I happen to know that two months ago they enabled real-time event streaming by Cloudera Kafka from our customer information repository. When a customer update happens and changes their street address, for example, we publish through Kafka to get that information into our enterprise data hub in near real-time, as opposed to waiting for tomorrow's file transfer. My understanding of that tech is that it requires a queue can be defined between the source and destination but may not scale. It kind of reminds me of the early days of MQ when we had point-to-point MQ happening all over the place. We got about 150 queues in and realized, "Oh my God! Having a hub would be nice." Then, we implemented IBM MQ hub and waited for the next best opportunity to get folks to talk to the hub.
I'm thinking the same thing will probably happen with Kafka emerging through our enterprise data hub service that individually setting up queues to get events into the enterprise data hub. Getting these individual messages one by one for 600 applications will become onerous for the operations and support teams. I suspect before we get to that number that an event mesh will garner more attention.
How was the initial setup?
The initial setup was straightforward. We were a bit lucky because we have a guy on our technologies operations team who did the initial setup of the physical appliances. When it came time to get the software and run it on servers, like Azure, it was relatively easy. Because we outsourced our infrastructure operations and monitoring tools to a service provider, the most complicated part was getting the firewall rules figured out for the publishers from the the legacy systems. The complexity of setting up their product had nothing to do with the Solace.
We are not live yet, but we're deploying using Azure with the intent to build our first bridge to the Azure Event Hub. The applications are hosted with Azure so we're recommending that they leverage cloud-native messaging technology, or Azure native messaging tech. We'll listen in on the messages that traverse the Azure Event Hub and fork them over to a Splunk (probably). The strategy is sort of non-disruptive and not mission-critical. In technology operations, we are just looking to see what events occur at Azure and trying to correlate them with events that are happening on-prem, since our customer information and account information are all stored in mainframes, NonStop environments, and platforms which are not moving to Azure. The implementation strategy is to insert Solace as means of transporting events into common spots so we can have a view of what's happening.
In a company that does rigorous change management, the initial setup took one of our guys probably three or four weeks. He was already supporting the physical appliances, so he had a bit of a running start. However, every time we cut a change record in our company, we need two weeks lead time: Two weeks to get our server infrastructure provisioned, then two weeks to get our firewall rules implemented. After four weeks, we were done.
A quarter of the same person's time who is also supporting the physical appliances is what is needed for maintenance.
What about the implementation team?
I have two techie guys who work on installing it. I am more of the enterprise architect, PowerPoint guy.
On use case number one, we struggle with our mean time to resolution and technology operations. We've outsourced a lot of our data center operations and server storage network operations to a third-party (DXC), who is formerly HPE Enterprise Services. They manage our data centers, OSs, and servers. CIBC applications are mostly homegrown, so we support and maintain our applications. We do code chops, code changes, DevOps toolchains, etc. So, when something goes bump, there is a lot of finger-pointing.
We have DXC publishing their events now. Going forward, we need to figure out which tools we correlate those events to and start recognizing some of the benefits.
What was our ROI?
We have not seen ROI.
The operational efficiencies that we intend to gain should result in a reduced internal chargeback of tech resources. That's really the ROI that we're going after: operational efficiency and better mean time to resolution for our incidents.
What's my experience with pricing, setup cost, and licensing?
We have been really happy with the product licensing rates. It has been free for us, up to a 100,000 transactions per second, and all we have to do is pay for support. Making their product available and accessible to us has not been a problem at all.
Having a free version is critical for our technology operations use case. This is primarily because our technology operations team is a cost center in our company. They are not profit drivers and having a free version for installation will probably meet our needs. Even for production, it'll support up to a 100,000 messages per second. I don't think in technology operations that we have that many events and alerts from our detection tools. Even if I have 20 or 30 event detection products out there, they're only going to publish the things which are critical or warnings. I don't think we'll ever reach a 100,000 messages per second.
We have been dealing with the free version for a better part of 18 months now. There have been no allergic reactions. You should expect maintenance costs, but we've not really needed that because we're not live yet in production for our first use case. For our physical appliances, capital markets folks were happy to get a big discount on the last version of the physical appliances. I've heard no complaints about what they're being charged for the Solace product that they've had in use for seven years. However, they haven't modernized any of their applications into Azure yet.
Which other solutions did I evaluate?
When we were searching for protocol-agnostic event meshes, I wasn't the one doing the research. It was our integration domain architect. He had experienced with Solace already. When he was doing market research for protocol-agnostic event meshes, his input to me was there was only one player, a Canadian company based out of Ottawa. Therefore, we didn't do a bake-off with anything else.
Other lines of business in our company have been using things like MQ Hub and IBM DataPower appliances. Our technology operations division has a program that I'm working on right now for trying to start getting our tools to interact together using Solace Event Broker.
Our company is pretty passionate about making sure that we have vendor support. When we do use open source products, we go out and get third-party support. When compared to some other messaging hubs that we do have, I have to admit that our IBM MQ Hub has been also incident free for many years while running on a mainframe, but our IBM DataPower experience has not been good. I would say that Solace fits right up there with the best that we have for message transport in our company.
Topic wildcarding implies that if we had a set hierarchy for our topic naming convention that we could deliver it to subscribers based on wild cards, which is something that differentiates from Kafka. We're not leveraging topic wildcarding, but my understanding of the tech is it would allow our security tools (for example) to be able to poke their nose into topics of interest to them using authorizations that Solace would control.
Kafka is really the only other competitor. We have IBM DataPower, but that's not really a fair comparison. We aren't intending to do format or data transformations with this tech. We're only looking at protocol transformations and message transport. Kafka has gotten a lot of momentum whenever our app developers Google that stuff, they get a lot of support and hits. Trying to find some momentum for Solace has been a bit difficult, but the idea of having Solace be our protocol-agnostic message transport system is the plan. However, when we have only had a small number of applications hosted in the cloud right now, the point-to-point message delivery is not unmanageable. Building a Kafka interface to something with Azure is tolerable and manageable when we have less than five subscribers.
When we realized that that message would be best consumed by something that talks a different language, then we'll start recognizing Solace is an important instead of publishing a message twice in two different protocols. We'll be able to do it by publishing to the Azure Event Hub, not worrying about what language our subscribers talk. We've been juggling between: Do we do Kafka or do we do Solace? Right now, the momentum for Solace is not yet there because the volumes of applications modernizing are so low. But that tide is changing, we're gaining some speed.
In technology operations, we have no use cases that are Kafka-centric. That's mostly because our enterprise tooling doesn't exchange data with anything. There are just these stovepipes of monitoring data.
What other advice do I have?
Get folks in various stovepipes to recognize that their data is valuable to aggregate for the entire enterprise. The biggest lesson learnt for me in use case number one has been to get various support organizations to realize that publishing your data is not about pointing fingers and finding culprits. It's about efficiency of restoring service.
The solution got us to look internally at how we operate and we behave as a split-brain support organization, where we have some of it on the inside and some of it outsourced. That has been a benefit to us.
I would rate this solution as a 10 (out of 10).
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?