PubSub+ Event Broker Review

We can add an application or users in the middle of the day, with no disruption to anyone


What is our primary use case?

We do a lot of pricing data through here, market data from the street that we feed onto the event bus and distribute out using permissioning and controls. Some of that external data has to have controls on top of it so we can give access to it. We also have internal pricing information that we generate ourselves and distribute out. So we have both server-based clients connecting and end-user clients from PCs. We have about 25,000 to 30,000 connections to the different appliances globally, from either servers or end-users, including desktop applications or a back-end trading service. These two use cases are direct messaging; fire-and-forget types of scenarios.

We also have what we call post-trade information, which is the guaranteed messaging piece for us. Once we book a trade, for example, that data, obviously, cannot be lost. It's a regulatory obligation to record that information, send it back out to the street, report it to regulators, etc. Those messages are all guaranteed.

We also have app-to-app messaging where, within an application team, they want to be able to send messages from the application servers, sharing data within their application stack. 

Those are the four big use cases that make up a large majority of the data.

But we have about 400 application teams using it. There are varied use cases and, from an API perspective, we're using Java, .NET, C, and we're using WebSockets and their JavaScript. We have quite a variety of connections to the different appliances, using it for slightly different use cases.

It's all on-prem across physical appliances. We have some that live in our DMZ, so external clients can connect to those. But the majority, 95 percent of the stuff, is on-prem and for internal clients. It's deployed across Sydney, Hong Kong, Tokyo, London, New York, and Toronto, all connected together.

How has it helped my organization?

With the old platforms we were coming from, if we wanted to make changes, some of those changes were intrusive to make. For example, to add a new application into the environment, we would have to make a change that might cause some disruptions to the environment. We only have very limited downtime for our environment on a Saturday after midnight and before midnight again on Sunday. That is our only change-window for the week, if we have to do something intrusive. That limited us to when we could truly make changes. On a lot of other vendors' platforms, to add things, you've got to restart components and cause disruption. 

The benefit of Solace is that we can add an application in the middle of the day, with no disruption to anyone. It's purely based on our access-control list and permissioning. We can add an application in with zero disruption. We can onboard applications during the middle of our business day. It's still under change control, but there's zero impact by doing it. For us, that is super-powerful. Whether we're adding users or adding applications, we can do it, without causing any disruption. For a lot of other products, that's not the case. That's been a huge win for us.

In terms of application design, I've seen applications go live in less than a week, from coding the first line of code to putting something into production. It depends on how complex the application is. We have a central team where we support the wrappers on top of the vendor's API and we have some example code bases where we show a simple application built using our wrapper on top of Solace's API. A developer who joins our company knowing nothing about Solace, can walk through our documentation, have a look at our wrappers, take some of our example code, and get up and running and off to the races pretty quickly. Getting up to speed is definitely not difficult.

We might get a new user in our bank who is familiar with other messaging systems and who has preconceived ideas on how they want to do things. They might ask us, "How do I get access to this messaging system that I used to use with my old organization? That's what I'm familiar with." Sometimes we have to do sessions with those people and say, "Okay, we're familiar with the systems you're talking about. We supported them in the past. Talk us through what your use case is, what it is you are trying to achieve." Once they explain their use cases, we can say, "Okay, great. We actually have this and here's some example code and this how to do it." Within a day, that person has gone from knowing nothing about it to saying, "Okay, you're, absolutely meeting my application needs and now I'm educated on how this works." They're off and running very quickly.

We take all kinds of data onto the environment to share. Because the event bus is the place that every application always needs to start, they're no longer building an application now within the capital markets organization without putting their data onto our bus in some way. It's definitely a way of lowering the barrier to sharing data and getting things up and running quickly. Similarly, they can take data from other teams, once they find out what's available. Someone might say, "I need all the FX prices in the bank. Oh, I can just subscribe them from here. I don't even need to talk to the FX team." Teams can get up and running very quickly without having to spend a lot of time working with other groups to make that happen.

By having all of that data together in one place, Event Broker has definitely reduced the amount of time it takes to get a new application onboarded. We came from a place with six or seven different systems, where we might have bridged some of those together in some way, but it wasn't one common environment. Now, we've got application A that comes online and starts putting that data out for application B to get up to speed and to start looking at that data. That is very quick and easy for us to do. All the messaging that we do is self-describing. They can look at the payload of a message and understand it without even needing to talk to the upstream application. We can have applications starting to look at data where they didn't even have to speak to the upstream application. We've gone from 8 x 1 Gig, 10 years ago, to 8 x 10 Gigs today, and the reason for that is because we keep putting more and more data and applications on here. That continues to grow exponentially. If it wasn't easy to do, the data wouldn't be going up and we wouldn't have all these applications now on here. It's hard for me to say it has definitely increased the productivity, because I don't own the application development piece but, anecdotally, I would say it has.

Another area of benefit is that we're in the process of containerizing all of our applications at the moment, whether they'll be run on-prem or in the public cloud. The underlying piece is that these containers, wherever they run, are going to need to share data between the different applications and then back to the users. The Solace event mesh or event brokers are the underlying lifeblood among all of these containers. They need to have some way of communicating with each other and we see solace as being that connection among all of them. All the different cloud environments have their own messaging and we don't want to build applications that are specific to any one cloud; we want to be cloud-agnostic. To do that, we need to have a messaging system that is equally agnostic. Given that we already have a huge investment on-premise for all of our Solace stuff, we see that the future of containerizing our applications goes hand-in-hand with our messaging strategy around Solace, so we can be totally cloud-agnostic.

Technology, in the last 10 years, has probably become a lot more stable generally, but I can say that with the amount of data we put through these appliances, and route globally every day, if our environment was down capital markets wouldn't be operating for the bank. That's how critical it is. We can't afford to have any issues. At the same time, literally no application can run in our front office without this. If I look back 10 years ago, we might have had six or seven different distributed systems, all with their own problems. Now that we've consolidated all that, there's a huge efficiency by sharing all our data between the different groups. It means we can get up to speed very quickly, but also, what we're enabling from a business perspective, by sharing 95 billion messages a day, is hugely valuable to our front office.

What is most valuable?

I've been running messaging systems for most of my career, getting on toward 16 or 17 years. The most valuable feature is the ability of the appliances to cope in a way that I haven't seen other vendors do. You always get into types of message-loss states that can't be explained with some other products that are out there. You raise tickets with the vendors and they'll give you an explanation. But in the 10 years that we've been in production with Solace, we've never had something that cannot be explained. I've got tickets open with the likes of IBM that have never been resolved, for years. The Solace product's stability is absolutely essential. 

There is also the ability to have so many things laid in, where we're doing guaranteed messaging and direct messaging laid into the same appliance.

There is also the interoperability. We've built a lot of products into it and it's been quite easy to feed market data onto the systems and put entitlements and controls around that. That was a big win for us when we were consolidating our platforms down. Trying to have one event bus, one messaging bus, for the whole globe, and consolidate everything over time, has been key for us. We've been able to do that through one API, even if it's across the different languages. We support a wrapper on top of the vendor's API and we enforce certain specifications for connecting to our messaging environment. That way, we've been able to have that common way of sending and sharing data across all the groups. That has been very important for us. 

In terms of ease of management, from a configuration perspective you can have all your appliances within one central console. You can see your whole estate from there. And you can configure the appliances through API calls so you can be centrally polling and managing and monitoring them, and configure them as you need to. There are certain things where that's a little more tricky to do, but at a general level we have abstracted things like user-commissioning into other systems. So we just have a front-end where we change the commissioning and push it to the appliance in whatever region and it updates the commissioning. From a central management and configuration point of view, it's been extremely easy to interact, operate, and support.

When it comes to granularity, you can literally do anything regarding how the filtering works. It has a caching product that sits on top of that, so depending on the region that you're trying to filter, caching level can make it a bit more difficult than the real-time streaming. But from a real-time stream, you can pretty much filter at any level or component and it's extremely flexible in that regard.

What needs improvement?

We have various items on the docket with Solace. We've pointed out some things with the DMR piece, the event mesh, in edge cases where we could see a problem. Something like 99 percent of users wouldn't ever see this problem, but it has to do with if you get multiple bad clients sending data over a WAN, for example. That could then impact other clients. In our current state, we've architected around that with Solace. We can see, in the future, with the event mesh DMR, that there is potential for several bad clients to cause problems for other clients. We actually had a design session yesterday with the head of engineering where we started working on how to solve that 1 percent "corner case." We're working on the basis that if it can happen it will, even if it's very unlikely. We design for those kinds of days as opposed to, "Oh, this will never happen." 

It's really about multiple streams and guaranteed messaging between multiple regions over WAN potentially causing one user a problem. We're trying to solve for that kind of stuff. It's very specific, but that's just how we think. Being a financial organization that is obviously regulated, we can't afford downtime. So we try to look at everything through that lens.

For how long have I used the solution?

We've been using Solace products for over a decade.

What do I think about the stability of the solution?

The uptime on the appliances is huge. We just don't have problems with these appliances in production. For example, we have just gone through the whole COVID-19 situation and the markets went crazy during that. Our previous maximum of data through the appliances in any one day was about 67 billion messages. During the COVID-19 of February and March, we hit 95 billion messages a day. That was a 40 percent increase on the data rates and the environment coped just fine. We didn't have any problems. There was zero business disruption. I don't know of any other system where, if I threw an extra 20 or 30 billion messages at it, without adding anything, without having to change anything, it would just cope. If it wasn't able to cope with that, the amount of money that might have been lost to the organization would have been exponential. It's definitely paying for itself.

Going back 10 years ago — I want to be real clear, not recently — there were some issues with disks that were in the devices. It was just a faulty batch of disks from their supplier. We had to change the disks. But everything is resilient. So when we had these failures — they were more common than you would expect — we might have a HA failover but not an outage, per se. But that was a very long time ago. 

The only other thing that causes issues, and I use that term loosely, is that these are the biggest things on our network within the bank. An 80-Gig appliance is the biggest thing that talks on the network, and it's sending an awful lot of traffic. What you tend to then get into are problems with your own network not being able to cope. You may not have built your network to cope with the volume of traffic you want to try putting over it. As a company we have definitely experienced that over the last few years. It's not a Solace issue, but more a pure core-networking issue. That's a common issue that I know Solace's clients deal with. I meet other Solace clients through various events and they're all having challenges with their network team actually providing a good network to be able to cope. You've got a very strong messaging product that sits on top of the network. It's the biggest thing on the network. Is your network then able to cope with it? So we've had Solace's engineers on calls with our network team, walking them through. That's probably the biggest pain point we have, but it's not a Solace fault.

What do I think about the scalability of the solution?

The scalability of these over time has been very good. When we started on them 10 years ago, we were 8 x 1 Gig appliances, so we had 8 Gig of capacity? We're now doing 8 x 10 Gigs. In 10 years we've grown our footprint by 10 times in terms of volume. And the number of servers, the appliances in our data centers, hasn't really increased. They've obviously continued to grow the capacity of the appliances over that 10 years, without us needing to buy another 20 or 30 appliances to continue to build out. They have the ability to scale.

In terms of users of the solution, there are about 9,000 people in capital markets, of which I'd say about 6,000 or 7,000 of them are using it across the different geographies. Each of those users might be running multiple applications and making multiple connections to the appliances for different applications. A user might have four different applications on their desktop, and they would be making four connections. That works out to about 20,000 to 30,000 actual connections to the appliances. And we have about 5,000 servers in our data centers. A good 80 percent of those are making connections to Solace.

The amount of messaging that we put through it grows every year. We're constantly looking at the volume of data that goes through there and deciding if we need to stripe out the number of appliances to support that. Or, if Solace produces a bigger appliance, do we need to be buying it from a pure networking or volume-of-traffic point of view?

We are in the process of working through what our cloud implementation is going to look like with them. It's going to be a mixture of some of their messaging-as-a-service piece and some of us running our own Docker engines of the software version. There's going to be a bit of a mix as we bridge data between the public cloud, as we stand that up, and our existing on-prem appliances. We don't see the on-prem appliances going away anytime soon. There's no key to getting rid of those. We're putting so much traffic through them, it's massive. But, as some of our workload moves to the cloud, so will some of that traffic and we will need to be able to support that.

But every year the messaging rates only ever go up, as does the number of applications that come on. Last week we added another 1,300 users for a new application across three or four geographies and that was all completely seamless. It's continually growing. It's like the blood that pumps around the body, to be honest.

How are customer service and technical support?

Solace is truly the best company that we have to deal with when it comes to tech support. In the role that I have I deal with about 100 different vendors, everything from market data exchanges to software vendors, through the likes of IBM and Microsoft, etc. Ten years ago, when we first started dealing with them, Solace was obviously a much smaller company. They've grown. They were only some 50 or 60 people at the time and I think there are a couple of hundred now. All their support guys who were there originally are still there — they've added more over time — were excellent. They know everything about their HI and their environment. 

If I reach out to IBM, for example, I'm going to get passed to six help desks before anyone I reach even knows what product I'm talking about. I support Cloudera for our company, as well. Cloudera has sold its support to IBM and when I raise a ticket with IBM, I wait a week to get a response. I have had some pretty shocking support experiences.

We always felt that Solace's support wasn't going to survive as they grew as a company. It was so good. That was one issue I kept raising because it was so good I couldn't see how it would scale. Surely it couldn't. But I can tell you, 10 years later, Solace is still the only company where I have zero outstanding issues, or unknown items, or support tickets that they haven't resolved. If you have a problem, they jump on a WebEx with you and, within minutes, we know what it is. Whereas I can't even get IBM to respond to a support ticket.

I deal with a lot of different people in my role and I can genuinely put my hand on my heart and say they're the best support company that we deal with.

Which solution did I use previously and why did I switch?

We had TIBCO EMS, TIBCO RV, IB MQ, and Informatica's LBM. The latter used to be a company called 29West and Informatica bought them. We also had Thomson Reuters RMDS platform, which is now called TREP, sending messages around the planet.

We were using Thomson Reuters RMDS — Reuters Messaging Data System — as a generic messaging bus at the time. Even though you can put their data onto the platform, you can also use it to read your own data around the world. That was a big platform for us at the time and it was coming from two of the underlying systems. You could publish any message onto that bus and send it around. I worked at another bank before the one I'm at now, and we did exactly the same thing there. We were putting a lot of our own internal data onto their messaging bus. It was a good message bus and it still is.

But Thomson Reuters, at the time, now Refinitiv, decided to license it differently. They said that if you put your own data on their platform, they wanted to be paid by every message you sent. We thought, "Okay, well that's crazy. If we buy something from you and pay you a million dollars for it, and then send a hundred messages or a million messages with it, that's nothing to do with you and we're not going to pay you for it." They tried across the entire street to change their pricing model and they really shot themselves in the foot. A lot of people walked away from them over it.

We knew at that point we needed to do something else. We had TIBCO RV, TIBCO EMS; we had so many different systems that we were trying to bridge and connect together, but the RMDS platform along with TIBCO RV dwarfed all the others. Those two together made up 90 percent of all the traffic. That really pushed us to go out.

How was the initial setup?

We spent about two to three months designing out our topic hierarchy when we started this 10 years ago. In the last 10 years we've made very few changes to our topic hierarchy and schema. But we sat with Solace and designed it out. We created a 90-page manual for how we wanted to stand up our event mesh at the time. Bear in mind that our first implementation was not guaranteed messaging, but direct messaging. It was between Sydney, Hong Kong, Tokyo, London, New York, and Toronto. We had primary and secondary data centers in every region. I would never characterize it as simple because of the overall scale of what we were putting in place. The actual configuration, and working with Solace to implement that originally, that wasn't the difficult piece of it. Actually standing it up — once we had the appliances in our data centers and all on the network — hooking them up and making them work together that wasn't complex.

What was more complex was the fact that we were meshing up six regions at the same time, and turning on a brand new environment. We didn't stay in one region. We didn't just turn London on. We went big from day one, so it was complex from a geographies perspective, but not complex from a Solace-configuration perspective.

We paid for their heads of engineering to come and sit onsite with us and work through that document. I've actually recommended to Solace that they shouldn't sell their product to anyone without doing that design work upfront because I think it's extremely valuable.

This is true of any system. If you take a good system and don't architect it well, then you can make a good system really bad. Two years down the road you've got people saying, "Okay, I want to go somewhere else," because we've done a bad job of this. Anecdotally, I was talking to the CEO of Confluent about six to nine months ago, and he told me that a large, well-known company has redone its Kafka implementation three times in two years, because they hadn't architected it properly. You can take any technology and make it bad.

Our deployment took about six months, start to finish, from initial discussions and purely white-boarding through to being live in six regions. The first five years after it was implemented, we weren't allowed to build any net-new application that didn't go onto the bus. Every application has a three-year life cycle within the bank. In that five years, a good 80 percent of our applications had been completely rewritten, at which point we only had 20 percent left on our old environment to force over and bridge between old and new environments. After a couple of years of doing that, we didn't have to run any of the old environments anymore and just had one major platform that everyone connects to. That has been the state for the last five or six years.

I speak to other Solace clients occasionally, new ones who are looking at starting up, and they say, "Well, can we be done in a year?" And I say, "Well, your Solace can be done. That's not the issue. It's your life cycle of applications. If anyone tells you you're going to switch all your applications in one year, it's nonsense." Yes, it depends on the scale. If you're a small company, sure. But if you're a company of our size, you've got hundreds of applications and you're not going to rewrite them all overnight. But, we did a migration of JMS users from TIBCO EMS a few years ago and that was actually very simple. It was two or three lines of codes for each of the 200 applications that were connected. Within about three months we'd moved 200 applications. So it is easy to do pure JMS conversions, for example. But if you've actually got to rewrite the application completely, because you're changing how it operates, that's very different.

In that three months of discussion that I mentioned, we were working on our topic hierarchy and making sure that we didn't have any pitfalls. The rest was that it takes a long time to get things set up at data centers, racked and networked and dealing with the firewalls. But the actual configuration of the appliances between all the regions was only about two weeks' worth total, for 12 different data centers. That was not the lion's share of the work. The planning for doing it across multiple regions was the lion's share of that.

The topic hierarchy is hugely flexible, but you do have to put time in to plan your hierarchy and try to think through all the eventualities of how you're going to use it. Otherwise, it can become a bit of a free-for-all if you don't govern and control it in some way. You need a good onboarding process for how you want to use things. If you leave it totally open to your teams to choose, you're going to end up with a bit of a mess.

For naming, we start everything with a region and go from there:

  • where the data is coming from or to
  • what business area the data is related to
  • what type of data it is
  • what application team
  • what instances they're coming from
  • then we get into the actual data name itself.

There are six or seven layers of our topic schema that we have published. After that, the application teams can be specific on how they want to name the seventh or eighth level. But the first several levels are defined by us and we say, "Okay, if you're this, you're going to be choosing New York, you're going to be choosing fixed income, you're going to be choosing that this is market-data price, and then you're going to be choosing that your application name is this, and the datatype is real-time. And the message instrument itself is X and the data it contains is Y." So we've already mapped out our schema for all those levels, and then they can put their payload in at that level.

This way, it becomes really easy if you're trying to wildcard things at a higher level. You can say, "I just want to see all the market data prices." I can wildcard three levels and be able to pick those up without having to know anything else. I can look at pretty much any topic name that someone has. And you've got 255 characters to choose from. I've seen people who try to map everything, but then it becomes unreadable. Unless you've got a guide to figure out what topic schema look like, it becomes very difficult for a human to interpret. It has to be readable to them. Six to eight levels works, without needing some sort of decoder to work out what things mean.

In terms of staff involved in the deployment at the time, we had about 16 people, globally, across the different regions. But this wasn't the only thing they were doing. We also support 20 or 30 different systems because we look after the market data system for the bank as well. Solace isn't our only job. In addition to those 16 people for the initial implementation we had 30-something in compliance across Prod, QA, and Dev, etc.

Today, the number of people we have doing maintenance on it is in the high 20s . We haven't exponentially grown our staff around what we're charging back to the business for the true staffing of this. The only thing we have grown out a little bit, over time, is our development team that supports the applications, as we've had 400 applications come on. They have general, day-to-day questions. We only have three people in that Dev team, but they're acting like a first-responder before we raise a question to Solace's support team around API issues. A lot of the questions people ask are common questions that we've answered two times already. We have a lot of Confluence pages with basic how-to and FAQs. But sometimes people just want to jump on a call, go WebEx, and walk through what they were thinking of doing. We only had one developer doing that originally and we've got three now.

We're just going through an upgrade at the moment. We've been trying out a few of the version 9s, version 9.1, 9.2, 9.3. Version 9.5 is the one we're planning to roll out in production at the moment.

What about the implementation team?

Although we didn't do so on day one, we now work with three companies in this ecosystem. There is a company called BCCG, a Germany-based company. We originally wrote some feed-handlers with Solace to bring market data from companies like Refinitiv and Bloomberg onto the platform. We didn't want to own those, long-term. We felt it was something that could be out on the street. So we partnered up with this company, BCCG, who Bloomberg recommended to us. They're a small startup company and they now own the feed-handlers and the permissioning agents and are selling those as a product on the street. They have a partnership with Solace 

We also partnered with a company called MDX Technology and that was really for an Excel plugin. We have a lot of users who use Excel sheets and we want to be able to send and receive data from and to Excel. So MDXT wrote a plugin for Solace. They have plugins to a lot of other messaging environments. They just created one for Solace and, again, they're selling it out on the street. They built it based on us and now they have sold it to plenty of other Solace clients.

We also partnered with ITRS, which is a monitoring company, to build plugins on top of Solace's environment. ITRS is our monitoring system. Every major bank uses them. They have plugins into all the different systems that you might have. We worked with ITRS and Solace to create monitoring for Solace. Again, ITRS has then sold that to a whole bunch of Solace's customers.

The only other one is a company called CJC, which is more of a consultancy and support company. During Asia-PAC hours, they look after first-line support of the whole platform, including the market data as well as the Solace platform. They're doing level-one and level-two during the day in Hong Kong. That's not in any way expensive. They're the company that actually supports Refinitiv's platform so they already have people and staff there.

What was our ROI?

Capital markets couldn't operate today if Solace were down. Our turnover on a daily basis is significant. To put a dollar value on it would be very difficult. But by not having 500 servers across the globe and having about 54 appliances at the moment instead, we've got a 10-to-one footprint, so in pure infrastructure costs we have hard-dollar savings. By having the appliances in, we've enabled the business to make millions on a daily basis.

Which other solutions did I evaluate?

We did an RFP and pulled all the vendors in, including Thomson Reuters, TIBCO, and a whole bunch of others such Informatica, and we did a proper vendor evaluation. It came down to Informatica and Solace, head-to-head, in the final decision.

The choice to go with the Solace appliances has actually paid off massively in savings from an infrastructure point of view. The reason is that, in our old platforms, for example our RMDS Thomson Reuters platform, we had about 500 servers around the globe sending all the data to each other, meshed up in a huge administrative nightmare. The Informatica solution was going to be very similar, as in commodity hardware that you would mesh up to send all the data. We looked at that and said, "Well, a server in our data center is going to cost us $20,000 a year to run," so if we still had 500 of those, you can do the math. If we were to buy the Solace appliances, working out to about $100,000 each, we would then only have to pay support and maintenance on them for the next two or three years, at about $20,000 a year. We only needed 30 of them, compared to the 500 servers. This has been a huge cost saving for us. The 500 servers that we used to have are all gone, and we have replaced them with 30 to 40 appliances. The cost of running things in the data center has, therefore, shrunk significantly.

Although people do view Solace as being this premium product you pay a lot of money for, if you're going to put a lot of data through these things, the amount of servers you need to do that with is also extremely costly. We have saved millions a year by having the appliances, and that was something we picked up right at the beginning. We said, "If we go down this path and these appliances can truly do what they say, then the footprint in our data center is going to shrink 10-to-one, and the cost of running this in our data center is going to be significantly less."

We also support multiple instances of Kafka. There's an enterprise version within our bank, which is the biggest one, and we have some small pockets of it within capital markets. The configuration and support around Kafka, and the quantity of components needed to keep it going, are a configuration nightmare. We use the software broker for development. In our non-production environments we have a non-appliance based version running in things like Docker. But the ability to have one component that does everything, as opposed to having to layer in multiple components to be able to build the ecosystem for messaging or storage, is extremely powerful from a support perspective. The time spent on keeping Kafka running, compared to Solace, is not in the same league.

We have a lot of problems with Kafka, generally, that we do not have on Solace. The enterprise runs the majority of the Kafka, the stuff that we support for our regular Cloudera stack. To try to give an idea of scale, the enterprise bank is doing, maybe, a few million messages a day on its Kafka environment, which is still a big environment for them. But we're doing 95 billion messages, so we're not even in the same swim lanes. We know they have a lot of problems on that. And in our own Cloudera Kafka, we have problems with Cloudera period, and their IBM stuff. We're paying an onsite consultant from Cloudera, and have been for the last nine months, to try and fix their stuff. It's just awful. Whereas our Solace stuff is bulletproof.

Kafka has its place. There's absolutely no question about that. There is some stuff that it does really well, like some of the elastically expanding storage concepts that people have where they want to keep storing everything forever. They can keep elastically expanding their Kafka brokers to do that. Whereas, with a Solace appliance, you are going to have a SAN storage connected to it and you're limited by the size of the SAN you can put on there, or you're going to need to buy another appliance and buy another SAN. With their software broker you could elastically expand that, but you still have the storage issues. 

The one real positive with Kafka is that you have a big community of people, and this is something I've spoken to Solace about too. There is this groundswell of community around it, where there are a lot of adapters that are off-the-shelf to a lot of other things. It's a double-edged sword. Sometimes we have new users join the bank who say, "Yeah, but Kafka has a SQL adapter off-the-shelf." We say, "Okay, but we already have written a SQL adapter for Solace. Here you go. It was 10 minutes' work." At the same time, it is nice to have a catalog of 200 adapters that you can use on Kafka. That is definitely a benefit of Kafka, with the community around it. But at the same time, when you scratch the surface of it, the amount of work to do a plugin isn't actually much more, and with the Kafka stuff you need six or seven different components to run it. 

In my last design overview with the console guys they said, "And then we're going to add this component, and if you want global..." and I said, "Well, actually, all our stuff is global. We don't do anything that's just one region." They said, "Well we haven't gotten our global solution built yet so you could run two versions and start copying data." I said, "Well, I don't really want to do that. We want you to be able to replicate data between regions, under the covers." They're now doing that. They're getting up to speed on some of those things. It all depends on what your use case is.

We even have some stuff where, at the edge of our environment, we might bridge data between Solace and Kafka and we've got a bridge component to do that. It would be when there's a very specific use case around what someone wanted to do. For example, if a third-party vendor is only supporting Kafka, we'll plug in Kafka there, but we don't want people then connecting to Kafka because there's no need for it. So we'll then bridge from Kafka to Solace so the data is all on Solace. There are definitely use cases for Kafka. It's just that the scale of Kafka, depending on what the use case is, is a little bit different. I feel people use Kafka because they're just trying to lazily store everything as a long-term retention process.

The implementation of Kafka compared to Solace is very different. As I mentioned, there are multiple components to build up Kafka. I can tell you that our Confluent contract is not cheap because we're really employing Confluent employees to come and help configure half the stuff and do hand-holding all the time. We don't really have those kinds of challenges on the Solace environment. We're far more comfortable supporting the Solace environment than our Kafka environment.

What other advice do I have?

If I was coming into this cold, and knowing what I know today, the one thing we would do differently is we'd have the network team involved throughout the whole process of bringing it into the bank. Bring your network team on that journey with you, because if it's going to become like it has with us — the biggest thing on the network — then you want to have the network team at the table from day one. That way, networking knows things are coming. We're putting these huge things into the data centers and they're going to send huge amounts of data around. That team needs to be ready, so they need to be at the table. 

In terms of the onboarding and governance processes, fortunately we did think ahead and plan that stuff. But I speak to other customers that didn't and they're struggling with having the right onboarding processes and the right governance around things. At the end of the day, if you've got 95 billion messages going around, if you don't have a good onboarding and governance process, you could just have a 95-billion message mess. We don't have that because we had a good governance and a good architecture to begin with.

As I mentioned, I've suggested to Solace that they shouldn't sell their products without enforcing a bit of the architectural piece to begin with. The problem is that everyone has their own budgets and thinks, "Oh, I don't need you guys to help me, and I don't want to pay for it," figuring that Solace is trying to push its Professional Services a bit. But that small investment in Professional Services, when you first stand it up, could be hugely involved in the success of your platform. The Solace Professional Services that we've experienced, and the general value out of that, is worth the dollars you pay for it.

From a maintenance point of view, every time Solace releases a new version of the API, we review what has changed in that and whether it affects us in any way. Sometimes a release is for something specific that another client has asked for and that doesn't have any value to us. We don't force applications to upgrade every time a version changes. We tend to do a yearly request of the application teams to upgrade their API to the latest one that we vetted. It's like a yearly maintenance to update the API. And to do that work, to integrate the new API version, it's generally not more than half an afternoon's work to put it in. It might take longer than that to QA, test, and validate your application to put it into production, but the actual coding piece takes an hour or two at most. It's not a huge overhead to be able to do that.

In terms of the event mesh feature, we're a bit of a "halfway house." They have multiple things. One is called dynamic message routing (DMR) and another is multi-node routing (MNR). We use the multi-node routing piece. We are testing out the DMR piece of it, which is their newest function for public cloud use. We're in a proof of concept with them around using that for expanding out into Azure and AWS.

Internally, we're using their MNR so it's all an event mesh and everything is automatic. If you publish a message in Sydney and you want us to scribe it in New York, we have to do nothing to get that message from A to B. You subscribe and it gets there. Depending on which terminology you're using around event mesh, we consider ourselves to be on event mesh, but we have not deployed that for guaranteed messaging for our general population. We're still using their multi-node routing, which means direct messages fly on demand, and we have to bridge guaranteed messaging.

The clustering feature is really designed around trying to make things easier for clients on configuration, so that you don't have to look at things as an HA pair in a DR device, by representing that as a cluster node. This is all work related to trying to make things easier from a support perspective. Today, if you make a change on an HA pair, you can then force-sync that to DR. It automatically happens to the HA box so you only make a change on the primary; it syncs to the backup. You can then choose whether you want to sync that to the DR device or not by putting it into a cluster node. They're just making it simpler for people. It's definitely a positive. We've actually been involved in helping them design that because we were one of their first and one of their bigger customers. We sit in with their engineering at least every six months and they walk through things they've got coming down the road and we talk about how they go about implementing stuff.

As for the free version of Solace, at the time, 10 years ago, the free version — that's the software version — didn't exist. With the software version there are limits to the number of messages, something like 10,000 messages a second. We're doing 1,000,000 messages a second. We could run lots of 10,000 messages-a-second instances, but then we would need a lot of commodity servers to run them on. If you are a small company that has some messaging requirements and you are looking for a good way to do that, the free version is absolutely an option. It doesn't come with any support either, obviously. You can pay for support on top of that version, but it's only going to do you 10,000 messages a second. At the scale we have, that wouldn't work. For non-production, giving that to a developer to run on their machine, to play around with, absolutely. So we don't really pay for any of the Dev stuff that we have. We're only paying for the physical production appliances and the reason we need those is just the scale of messaging that we do.

Which deployment model are you using for this solution?

On-premises
**Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
More PubSub+ Event Broker reviews from users
...who work at a Financial Services Firm
...who compared it with IBM MQ
Add a Comment
Guest