Microsoft Parallel Data Warehouse Review

It's a scale out, MPP shared nothing architecture where there are multiple physical nodes.


Originally published at https://www.linkedin.com/pulse/microsoft-parallel-data-warehouse-pdw-stephen-c-folkerts

What’s the Difference Between Microsoft’s Parallel Data Warehouse (PDW) and SQL Server?

In this post, I’ll provide an in-depth view of Microsoft SQL Server Parallel Data Warehouse (PDW) and differentiate PDW with SQL Server SMP.

SQL Server is a scale up, Symmetric Multi-Processing (SMP) architecture. It doesn’t scale out to a Massively Parallel Processing (MPP) design. SQL Server SMP runs queries sequentially on a shared everything architecture. This means everything is processed on a single server and shares CPU, memory, and disk. In order to get more horse power out of your SMP box, or as your data grows, you need to buy a brand new larger more expensive server that has faster processors, more memory, and more storage, and find some other use for the old machine.

SQL Server PDW is designed for parallel processing. PDW is a scale out, MPP shared nothing architecture where there are multiple physical nodes. MPP architectures allow multiple servers to cooperate as one thus enabling distributed and parallel processing of large scan-based queries against large data volume tables. Most all data warehouse workload centric appliances leverage MPP architecture in some form of fashion. Each PDW node runs its own instance of SQL Server with dedicated CPU, memory, and storage. As queries go through the system, they are broken up to run simultaneously over each physical node. The primary benefits include the breath-taking query performance gains MPP provides, and the ability to add additional hardware to your deployment to linearly scale out to petabytes of data, without the diminishing returns of an SMP architecture.


The Grey Zone, When to use SQL Server PDW & What About Netezza, Teradata, Exasol, & a Hundred Others?

Often we’re not dealing with petabyte-scale data. Not even close, just in the terabytes, and we’re in a ‘grey zone’, where SQL Server SMP overlaps with PDW. Or the data is all relational and well structured, and there’s no Big Data business need like social media analysis, or fraud detection or combine structured and unstructured data from internal and external sources.

The capabilities of xVelocity Columnstore indexes and other in-memory capabilities and performance enhancements in SQL Server SMP should first be explored before recommending a PDW appliance. A combination of technologies, all native to SQL Server SMP may be the answer if you’re just dealing with relational data problems. The distinction between these sister products will always be blurry. And the underlying question of when to use SQL Server SMP versus PDW will persist, especially since SQL Server capabilities will keep clawing up into PDW territory, while PDW keeps growing. It is wise to understand the important differences for decision making.

Organizations demand results in near real time, and they expect their internal systems to match the speed of an Internet search engine to analyze virtually all data, regardless of its size or type. Once non-relational, unstructured or semi-structured data is thrown into the mix, you suddenly have a very different story. Business analysts struggle to figure out how to add the value of non-relational Hadoop data into their analysis. As a result, they’re held back from making faster more accurate data-driven decisions that are needed to compete in the marketplace. This is the current data challenge.

If you want to get into Netezza and Teradata, see my article Should I Choose Netezza, Teradata or Something Else?

Microsoft Parallel Data Warehouse (PDW)

Microsoft PDW is the rebuilt DATAllegro appliance with hardware and software designed together to achieve maximum performance and scalability. If you’d like to know more, see my article Microsoft PDW History, DATAllegro.

Query processing in PDW is highly parallelized. Data is distributed across processing and storage units called Compute Nodes. Each Compute Node has its own direct attached storage, processors, and memory that run as an independent processing unit. The Control Node is the brains of PDW and figures out how to run each T-SQL query in parallel across all of the Compute Nodes. As a result, queries run fast!

Microsoft SQL Server is foundational to PDW and runs on each Compute Node. PDW uses updateable in-memory clustered columnstore indexes for high compression rates and fast performance on the individual Compute Nodes.

The first rack of the PDW appliance is called the base rack. Every appliance has at least one base rack with 2 or 3 SQL Server Compute Nodes, depending on vendor hardware. As your business requirements change, you can expand PDW by adding scale units to the base rack. When the base rack is full, PDW expands by adding additional racks, called expansion racks, and adding scale units to them.

The base rack has two InfiniBand and two Ethernet switches for redundant network connectivity. A dedicated server runs the Control Node and the Management Node. A spare server ships in the rack for failover clustering. Optionally, you can add a second spare server.

With PDW’s MPP design, you don’t need to buy a new system to add capacity. Instead, PDW grows by adding to the existing system. PDW is designed to expand processing, memory, and storage by adding scale units consisting of SQL Server Compute nodes. By scaling out, you can easily expand capacity to handle a few terabytes to over 6 petabytes in a single appliance.

You don’t need to over-buy and waste storage that you don’t need, and if you under-buy you can quickly add more capacity if your data growth is faster than projected. When one rack is full, you can purchase another rack and start filling it with Compute nodes.

You also don’t need to migrate your data to a new system in order to add capacity. You can scale out without having to redesign your application or re-engineer the distribution mechanism. There is no need to migrate your data to a new system, and no need to restructure your database files to accommodate more nodes. PDW takes care of redistributing your data across more Compute nodes.

In-Memory xVelocity Clustered Columnstore Indexes Improve Query Performance

If MPP provides the computing power for high-end data warehousing, columnar has emerged as one of the most powerful architectures. For certain kinds of applications, columnar provides both accelerated performance and much better compressibility. Teradata, picked up columnar capabilities with its acquisition of Aster Data. HP acquired Vertica, which gave it a columnar MPP database.

PDW uses in-memory clustered columnstore indexes to improve query performance and to store data more efficiently. These indexes are updateable, and are applied to the data after it is distributed. A clustered columnstore index stores, retrieves and manages data by using a columnar data format, called a columnstore. The data is compressed, stored, and managed as a collection of partial columns, called column segments. Columns often have similar data, which results in high compression rates. In turn, higher compression rates further improve query performance because SQL Server PDW can perform more query and data operations in-memory. Most queries select only a few columns from a table, which reduces total I/O to and from the physical media. The I/O is reduced because columnstore tables are stored and retrieved by column segments, and not by B-tree pages.

SQL Server PDW provides xVelocity columnstores that are both clustered and updateable which saves roughly 70% on overall storage use by eliminating the row store copy of the data entirely. The hundreds or thousands of terabytes of information in your EDW can be built entirely on xVelocity columnstores. Updates and direct bulk load are fully supported on xVelocity columnstores, simplifying and speeding up data loading, and enabling real-time data warehousing and trickle loading; all while maintaining interactive query responsiveness.

Combining xVelocity and PDW integrates fast, in-memory technology on top of a massively parallel processing (MPP) architecture. xVelocity technology was originally introduced with PowerPivot for Excel. The core technology provides an in-memory columnar storage engine designed for analytics. Storing data in xVelocity provides extremely high compression ratios and enables in-memory query processing. The combination yields query performance orders of magnitude faster than conventional database engines. Both SQL Server SMP and PDW provide xVelocity columnstores.

Fast Parallel Data Loads

Loads are significantly faster with PDW than SMP because the data is loaded, in parallel, into multiple instances of SQL Server. For example, if you have 10 Compute Nodes and you load 1 Terabyte of data, you will have 10 independent SQL Server databases that are each compressing and bulk inserting 100 GB of data at the same time. This 10 times faster than loading 1 TB into one instance of SQL Server.

PDW uses a data loading tool called dwloader which is the fastest way to load data into PDW and does in-database set-based transformation of data. You can also use SQL Server Integration Services (SSIS) to bulk load data into PDW. The data is loaded from an off-appliance loading staging server into PDW Compute nodes. Informatica also works with PDW.

Scalable, Fast, and Reliable

With PDW’s Massively Parallel Processing (MPP) design, queries run in minutes instead of hours, and in seconds instead of minutes in comparison to Symmetric Multi-Processing (SMP) databases. PDW is not only fast and scalable, it is designed with high redundancy and high availability, making it a reliable platform you can trust with your most business critical data. PDW is designed for simplicity which makes it easy to learn and to manage. PDW’s PolyBase technology for analyzing Hadoop HDInsight data, and its deep integration with Business Intelligence tools make it a comprehensive platform for building end-to-end solutions.

Fast & Expected Query Performance Gains

With PDW, complex queries can complete 5-100 times faster than data warehouses built on symmetric multi-processing (SMP) systems. 50 times faster means that queries complete in minutes instead of hours, or seconds instead of minutes. With this performance, your business analysts can perform ad-hoc queries or drill down into the details faster. As a result, your business can make better decisions, faster.

Why Queries Run Fast in PDW

Queries Run in PDW on Distributed and Highly Parallelized data. To support parallel query processing, PDW distributes fact table rows across the Compute Nodes and stores the table as many smaller physical tables. Within each SQL Server Compute Node, the distributed data is stored into 8 physical tables that are each stored on independent disk-pairs. Each independent storage location is called a distribution. PDW runs queries in parallel on each distribution. Since every Compute Node has 8 distributions, the degree of parallelism for a query is determined by the number of Compute Nodes. For example, if your appliance has 8 Compute Nodes your queries will run in parallel on 64 distributions across the appliance.

When PDW distributes a fact table, it uses one of the columns as the key for determining the distribution to which the row belongs. A hash function assigns each row to a distribution according to the key value in the distribution column. Every row in a table belongs to one and only one distribution. If you don’t choose the best distribution column, it’s easy to re-create the table using a different distribution column.

PDW doesn’t require that all tables get distributed. Small dimension tables are usually replicated to each SQL Server Compute Node. Replicating small tables speeds query processing since the data is always available on each Compute Node and there is no need to waste time transferring the data among the SQL Server Compute Nodes in order to satisfy a query.

PDW’s Cost-Based Query Optimizer

PDW’s cost-based query optimizer is the brain that makes parallel queries run fast and return accurate results. A result of Microsoft’s extensive research and development efforts, the query optimizer uses proprietary algorithms to successfully choose a high performing parallel query plan. The parallel query plan, contains all of the operations necessary to run the query in parallel. As a result, PDW handles all the complexity that comes with parallel processing and processes the query seamlessly in parallel behind the scenes. The results are streamed back to the client as though only one instance of SQL Server ran the query.

PDW Query Processing

Here’s a look into how PDW query processing works ‘under the covers’. First, a query client submits a T-SQL query to the Control Node, which coordinates the parallel query process. After receiving a query, PDW’s cost-based parallel query optimizer uses statistics to choose a query plan, from many options, for running the user query in parallel across the Compute Nodes. The Control Node sends the parallel query plan, called the dsql plan, to the Compute Nodes, and the Compute Nodes run the parallel query plan on their portion of the data.

The Compute Nodes each use SQL Server to run their portion of the query. When the Compute nodes finish, the results are quickly streamed back to the client through the Control node. All of this occurs quickly without landing data on the Control node, and the data does not bottleneck at the Control node.

PDW relies on co-located data, which means the right data must be on the right Compute Node at the right time before running a query. When two tables use the same distribution column they can be joined without moving data. Data movement is necessary though, when a distributed table is joined to another distributed table and the two tables are not distributed on the same column.

PDW Data Movement Service (DMS) Transfers Data Fast

PDW uses Data Movement Service (DMS) to efficiently move and transfer data among the SQL Server Compute Nodes, as necessary, for parallel query processing. Data movement is necessary when tables are joined where the data isn’t co-located DMS only moves the minimum amount of data necessary to satisfy the query. Since data movement takes time, the query optimizer considers the cost of moving the data when it chooses a query plan.

Microsoft Analytics Platform System (APS)

See my article Microsoft Analytics Platform System (APS) for a more in-depth look at Microsoft APS.

These views are my own and may not necessarily reflect those of my current or previous employers.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Add a Comment
Guest
Sign Up with Email