Originally published at https://www.linkedin.com/pulse/microsoft-analytics...
In April 2014, Microsoft announced the future vision for their data and analytics platforms. Microsoft Parallel Data Warehouse (PDW) was rebranded as Microsoft Analytics Platform System (APS) with additional appliance component offerings. APS combines MPP SQL Server data warehouse with HDInsight, Microsoft’s 100% Apache Hadoop component directly into the appliance. APS is a big data analytics appliance capable of analyzing data of any type, structured or unstructured, and of any size.
Microsoft ASP integrates data from SQL Server PDW with unstructured big data from Hadoop through the PolyBase data querying technology. Polybase gives APS a huge advantage over the competition because it you to talk with big data in the regular T-SQL language you already use and understand.
With APS, NoSQL doesn’t replace relational databases. Structured and unstructured data technologies complement one another, and queries can be executed across both universes.
By integrating Hadoop into the same rack as the relational data warehouse, organizations can save on consulting, development and configuration costs for Hadoop with an integrated appliance.
Why Microsoft APS is an EDW Game Changer
While ASP provides many innovations and improvements, four stand out as strategic game changes CIOs should consider when evaluating data warehouse and analytics strategies. SQL Server PDW, HDInsight, Polybase and xVelocity Columnstores as a Platform for Data Mining and Analysis.
By combining both Microsoft Parallel Data Warehouse (PDW) and unstructured Hadoop analytical capabilities in a single, easy-to-manage EDW appliance, Microsoft APS is well positioned to help organizations use information to enhance their competitive position.
Microsoft Parallel Data Warehouse (PDW)
Microsoft SQL Server Parallel Data Warehouse (PDW) and xVelocity Columnstores are covered in my article Microsoft Parallel Data Warehouse (PDW).
HDInsight & Hortonworks
HDInsight is Microsoft’s 100% Apache Hadoop distribution based on Hortonworks Data Platform. HDInsight is the phoenix that emerged from the ashes of Dryad. Dryad was Microsoft’s own proprietary and competing version of Hadoop that Microsoft tinkered with for 5 years before abandoning it.
PolyBase & Big Data Hadoop Integration
It’s not enough to store data in Hadoop. Businesses today need to figure out how they can analyze Hadoop data fast and seamlessly in order to make more informed business decisions. Unstructured and high volume data are the two fastest growing types of enterprise data.
Organizations are using Apache Hadoop to store process non-relational data from sources like blogs, clickstream data that is generated at a rapid rate, social sentiment data with different schemas customer feedback, sensor data, or telemetry data feeds. Most of this data is not suitable for relational database management systems and often ends up isolated from business users because it is not integrated with data in the traditional data warehouse.
Technologies, like Hadoop are generally used, but implementing Hadoop with traditional data warehouse and business intelligence platforms pose new challenges. Hadoop is both open source, Java-based and manages non-relational data across many nodes. It’s easy to add data to Hadoop, but not so quick to extract and analyze it. The idea is that, if the data is there, it may take a while to retrieve it, but at least the data is stored somewhere in the system. MapReduce doesn't have to be implemented in Java, however.
Big Data is not only about figuring out how to store, manage, and analyze data from non-relational sources, but also about mashing together various non-relational data with an organization’s relational data to gain business insight. See my articles on Big Data Data Lakes & Don’t Drown in the Data and 360 Degree View & Unifying Enterprise Data in a Sea.
PolyBase is the Microsoft APS query tool that enables you to easily query PDW and HDInsight data using T-SQL, without investing in Hadoop-based skills or training.
Microsoft PolyBase is a fundamental breakthrough on the data processing engine which enables integrated query across Hadoop and relational data. PolyBase opens up a whole new world of data analysis and integration possibilities. This integration allows organizations to merge large volumes of non-relational data stored within Hadoop with their traditional enterprise data. Customers can continue to use their existing analytics tool set to analyze their organization’s big data.
Without manual intervention, PolyBase Query Processor can accept a standard SQL query and join tables from a relational source with tables from a Hadoop source to return a combined result seamlessly to the user. Queries that run too slow in Hadoop can now run quickly in PDW, data mining queries can combine Hadoop and PDW data, Hadoop data can be stored as relational data in PDW, and query results can be stored back to Hadoop.
By using the power of Microsoft APS to run queries on Hadoop data in HDInsight, it is now possible to do more in-depth data mining, reporting, and analysis without acquiring the skills to run MapReduce queries in Hadoop. PolyBase gives you the flexibility to structure the Hadoop data you need, when you need it, as it’s brought into PDW for fast analysis. You can seamlessly select from both Hadoop data in HDInsight and PDW data in the same query, and join data from both data sources. To satisfy a query, PolyBase transfers data quickly and directly between PDW’s Compute Nodes and Hadoop’s Data Nodes.
APS uses external tables to point to data stored in text files on a Hadoop HDFS cluster. Once an external table is created, the table can be used in a select statement in the same manner as a PDW table. PolyBase uses a single Transact-SQL query interface to leverage PDW and Hadoop, so you don’t need to learn a host of new skills to run MapReduce queries in Hadoop. PolyBase hides all the complexity of using Hadoop so most business users do not need to know anything about Hadoop.
PolyBase uses ‘predicate pushdown’ to Hadoop that generates map-reduce jobs behind the scenes to do the work on the Hadoop side instead of distributed query data movement when necessary.
With PolyBase, organizations can take advantage of flexible hybrid Hadoop solutions and query across Hortonworks, Cloudera, and even into the cloud with Microsoft Azure HDInsight. PolyBase is only available in Microsoft APS and is not available in SQL Server SMP at this time.
Integration with Business Intelligence Tools
APS has deep integration with Microsoft’s BI tools and other leading non-Microsoft tools, making it simple to use the BI tools you are familiar with to perform analysis. APS’s deep integration with Business Intelligence (BI) tools makes APS a comprehensive platform for building end-to-end data mining and analysis solutions. APS integrates with the Microsoft BI Stack including Reporting Services, Analysis Services, PowerPivot for Excel, and PowerView. But, APS also integrates with a growing list of leading non-Microsoft BI platforms, such as Business Objects, Cognos, SAP Data Integrator, Tableau, MicroStrategy, QlikView, Oracle Business Intelligence, and TIBCO Spotfire.
Easy to Use & Manage
APS is designed for simplicity. The complexity is already engineered into the appliance so that you don’t have to handle the details. The appliance arrives with the hardware and software already configured and installed. PDW handles all the plug and play details of distributing the data across the appliance nodes, performs all the extra steps required to process queries in parallel, and manages the low-level hardware and software configuration settings. No tuning is required because the appliance is already built and tuned to balance CPU, memory, I/O, storage, network, and other resources.
Minimal Learning Curve
EPS has a minimal learning curve. There’s no need to hire new talent in order to move from SQL Server SMP to SQL Server PDW and EPS. DBAs who already know T-SQL can easily transfer their SQL Server SMP knowledge to PDW. Some T-SQL query statements are added or extended to accommodate the MPP architecture. There’s less DBA maintenance. You don’t need to create indexes besides a clustered columnstore index. DBAs can spend more of their time as architects and not baby sitters. In my opinion, the alignment of APS with existing IT skills may be its biggest competitive advantage.
The appliance model is key to getting great performance. Tuning a large database using traditional approaches is extremely difficult and requires highly skilled DBAs. One of the main problems with the SMP model, is the difficulty of understanding and tuning the interface between the DBMS software and the underlying OS and hardware platform. With SMP, there are a plethora of tuning parameters and options for the DBA and OS administrator to setup. In the appliance model, the entire software and hardware stack from SQL to storage is automatically controlled. As a result, virtually all the complexity is removed.
Microsoft APS has manageable costs. APS has lower price/terabyte over other companies by a significant margin. About 2x lower than Teradata, Oracle, Greenplum and others. It’s worth noting that Microsoft’s offering is cheaper than the competition not because of lower quality or missing capabilities, but because of a different business strategy. The strategy of commoditizing markets and then selling higher volumes to make up for lower margins. Given that SQL Server is one of the most popular enterprise databases on the planet, and APS falls under the SQL umbrella, it has enough of a relative advantage that it could easily become the biggest Big Data appliance player of all.
Microsoft APS & Hub and Spoke Architecture
See my article Microsoft APS & Hub and Spoke Architecture about using Microsoft APS to Build a Hub and Spoke EDW Architecture.
These views are my own and may not necessarily reflect those of my current or previous employers.