Originally published at https://www.linkedin.com/pulse/microsoft-aps-hub-spoke-architecture-stephen-c-folkerts
Using Microsoft APS to Build a Hub and Spoke EDW Architecture
Scalable, cost-effective Enterprise Data Warehouse (EDW) implementation is an elusive goal for many large organizations. Three common approaches include centralized EDW or the ‘monolithic’ approach. Decentralized collections of data marts, and attempts at Hub and Spoke architectures that combine the two. Microsoft APS combines the best features of both EDW and decentralized data marts.
The monolithic approach has become more common in large EDW installations. However, centralized EDW tends to be extremely expensive and very inflexible. As a result, business units become frustrated at the inability of the EDW to meet their needs within a reasonable cost and timeframe.
Decentralized EDW & Data Marts
A divide-and-conquer strategy is a natural and effective approach to addressing large-scale problems such as creating an integrated, enterprise-wide data warehouse. Decentralized EDW architectures align well with this approach and fit the many compartmentalized demands presented by large organizations. Decentralized data marts are more responsive to business unit needs, but often result in many versions of the same data that are very difficult to keep consistent across the enterprise. Each approach, centralized or decentralized, tends to evolve or degenerate into the other, but neither is a tenable long-term solution for most large organizations.
Microsoft APS & Hub and Spoke Architecture
A Microsoft APS appliance enables a true ‘hub and spoke’ architecture, where the centrally managed ‘hub’ contains detailed enterprise data, and departments or business units use ‘spokes’ to exchange data with the hub according to their unique schemas. This architecture can exist across many connected SQL Server MPP databases.
Microsoft expands the Hub and Spoke concept to include not only MPP appliances but also standard symmetric multi-processing (SMP) instances of SQL Server and SQL Server Analysis Services, (SSAS), allowing either to be viewed as nodes within a grid. The result is a highly flexible, affordable, and scalable platform that makes large-scale Hub and Spoke EDW architectures a practical reality. They combine the benefits of central control and governance with the agility of decentralized data marts, but without the inherent delivery pains, headaches, risks, and costs associated with previous strategies.
Hub and Spoke Architectures
Hub and Spoke architectures match the structure of most large enterprises by offering a combination of a centralized EDW and a set of dependent data marts. The EDW hub allows the entire enterprise to set and enforce common standards while answering questions that cut across business units. The data mart spokes allow business units to meet their own needs quickly and at relatively low cost while still conforming to the needs of the overall enterprise.
The Hub and Spoke architecture allows business units to set their own budgets and priorities, while contributing as necessary to the central EDW. This close fit between the architecture of the business and the architecture of the DW platform means Hub and Spoke systems are widely regarded as the best overall approach. In practice, Hub and Spoke systems have been notoriously difficult to implement.
Distributing data from a centralized EDW reliably and quickly enough to meet the needs of the business units is a big challenge in the face of growing data volumes. To try to compensate for this, complex and cumbersome ETL processes are developed to transfer data, between the hub and spokes, resulting in high maintenance costs and an inability to change with the business. In general, efforts to build a Hub and Spoke architecture have quickly degenerated into a set of siloed data marts after being torn apart by conflicting business units and requirements.
One response to the difficulties of building a Hub and Spoke architecture has been to simply centralize everything onto one monolithic EDW. A centralized EDW platform quickly becomes overloaded with conflicting use cases. Solving any one problem requires evaluation of all existing dependencies, which drives rigid change control processes and ultimately impacts cost and time-to-delivery for projects. And if virtualized data marts are used, all the queries and I/O, execute physically in the hub. Business units become frustrated by the inability of IT to quickly meet new requirements with the central EDW and start building their own independent physical data marts as a result.
With a decentralized approach, which the other two tend to degenerate into anyway, business units simply build their own independent data marts. Although such an approach is obviously responsive to business needs, it doesn’t allow management to answer cross-enterprise questions easily or quickly. Keeping all copies of data across a decentralized infrastructure current and accurate can become overwhelming. The problem becomes worse as relatively low bandwidth data movement options drive complex data transformations that scale poorly. And it’s very difficult to apply any real measure of enterprise-wide standards, controls or regulatory compliance.
Microsoft’s EDW Platform
A Microsoft APS appliance can be viewed as a highly-specialized grid of servers being pulled together to collectively form an EDW platform. Taking this view, it is a small step to think of PDW as both a grid of appliances and a grid of nodes. Moving data across this grid of appliances is incredibly efficient, since data can be moved directly from node to node within the grid. This maximizes parallelism across the environment and minimizes the conversion overhead associated with export and load operations. Such a grid of appliances can be used to implement a data warehousing Hub and Spoke architecture.
Microsoft expands the Hub and Spoke solution to include not only MPP appliances but also standard SMP instances of SQL Server and SSAS to be viewed as nodes within a grid. A grid of SMP databases and MPP appliances can be used as the basis for any large-scale data warehouse environment or architecture. However, it is particularly suitable for a Hub and Spoke architecture.
MPP for Hub and Spoke
Microsoft PDW with high speed parallel database copy is fundamental to solving one of the most intractable problems in large-scale data warehousing. Building an effective, scalable, and affordable Hub and Spoke solution. The basic idea is to take a divide-and-conquer approach to building an EDW. This avoids performance problems due to conflicts between queries from different business units. Provides a dedicated, high-speed, network interconnecting all hub and spoke databases. And business analysts view the appliance as a set of separate data marts, but can drill into detailed data on the hub where required.
The Microsoft Hub and Spoke Solution
Imagine a fairly large MPP appliance acting as the hub for a set of MPP appliance and SMP database data marts. The hub holds detailed data, probably in a normalized schema, for a number of business units or the entire enterprise. The hub is loaded in near real time or in daily batches from source systems leveraging a preferred ETL solution. Data is then transformed or restructured to a denormalized structure (star, cube, etc.), as needed, and transferred to the appropriate data mart(s) via the high speed grid for consumption by end users. If a data mart requires data from sources that are not covered by the hub, this data is loaded independently using standard ETL tools. However, most of the data required (both fact and dimensions) comes from the hub.
Users connect to the independent data mart appliances as usual for running queries. This allows each data mart to be tuned for the needs of a particular set of users and sized to handle the required level of performance and concurrency. While the data marts can be independently designed to meet the needs of each business, it will be possible to leverage existing data mart applications such as Microsoft Analysis Services, Reporting Services, Excel, or other BI vendor products.
Bandwidth within the grid is large enough to enable the direct copy of detailed fact data or entire data marts. This can greatly simplify the data mart creation and update process by using a publish-subscribe model as opposed to complex transformation logic that, coupled with expensive export and load scenarios, creates significant challenges for traditional federated approaches. The end result is an EDW platform that can handle a very complex workload while being extremely scalable at a sensible cost.
Disaster Recovery and High Availability
The Microsoft EDW platform provides the capability to set alternate database systems within the dedicated high speed network as failover targets. As an example, a user attempting to connect to a spoke that is currently unavailable would automatically be redirected to an alternate spoke specified within the standard connection protocol. This simple approach becomes very powerful when combined with the Hub and Spoke architecture. The high-speed and bandwidth of the grid copy facility allows full copies of end-user data marts to be moved to multiple spokes. This effectively recreates the end-user view of the data on multiple spoke systems, each a valid failover option for the other in an outage scenario.
This concept can also be leveraged across multiple data centers to provide an effective disaster recovery architecture. Individual appliances can be replicated on a second site and automatically kept up-to-date. Note that not all of the appliances on a grid would need to be replicated. In most scenarios only the hubs need to be replicated, as spokes can be recreated from the hubs. This provides the flexibility for each business unit to decide whether or not to provide a disaster recovery capability, based on their own service-level agreements (SLAs).
Microsoft’s Grid-Enablement Strategy
This approach offers customers an attractive alternative to centralized, monolithic approaches. Data marts can be tailored to meet the individual needs of business units (both in terms of capacity and performance). Furthermore, customers can buy into the Microsoft EDW approach with the deployment of a few stand-alone data marts on standard SQL Server SMP reference architectures. From this relatively low-cost start point, you can scale into the hundreds of terabytes while delivering manageable flexibility without sacrificing cost and performance.
Microsoft Analytics Platform System (APS)
See my article Microsoft Analytics Platform System (APS) for a more in-depth look at Microsoft APS.
These views are my own and may not necessarily reflect those of my current or previous employers.