Melissa Data Quality Scalability

GaryM
Data Architect at World Vision
We can run 9 million customer record exact matches in 10 minutes using 5 partitions/parallel dataflows. Survivorship takes another 50 minutes. This is on an 8 vproc VM. I'm sure you could run faster with dedicated hardware and running more parallel dataflows. The tool starts to exponentially slow down once you pass about 2 million customers in a single dataflow so its best to keep it at or under that number although mileage will vary depending on the complexity of your matching. In fact this tool is magnitudes faster than the last matching tool I used and it wasn't a simple plug-in to an ETL tool. I recently heard of another matching tool that takes longer to match just a few thousand as this tool takes to run millions of customers. Note: We probably run higher volumes than most organizations. For B2B and daily matching you could probably process a delta in a matter of a few minutes with this tool. So below describes complexities for us that may not apply to your situation. I suspect an essential ingredient when considering scalability is whether you're calling a web service for matching or just on-prem. Their SSIS component is only on-prem. Combining survivor-ship and matching in the same data flow slows performance. We got much better performance by running them in two separate runs - the first for just matching and then another for just survivor-ship (re-using the previous grouping numbers in the first match) to make it perform to our requirements. View full review »