On the surface it may look similar to an ordinary SQL Database, but inside the Azure SQL Data Warehouse is a tremendously powerful Massively Parallel Processing (MPP) engine. Originally there was the SQL Server Parallel Data Warehouse (PDW) applicance, which cost in the $1 million USD range (depending on how many compute nodes). Later it was renamed to Analytic Platform System (APS). In 2015 it was made available (first as a preview) also as a cloud service under the name Azure SQL Data Warehouse. So, with the Azure SQL Data Warehouse you can get the capacity of a multi-million dollar server.
The figure describes the general architecture of the Azure SQL Data Warehouse. It might look complicated, but most of this figure are things you don’t need to worry about.
Creating your first Azure SQL Data Warehouse
You can easily create an Azure SQL Data Warehouse like any other resource through the Azure Portal or through Azure Resource Manager. Just select Databases, and it should appear in the list.
The options are similar to creating an ordinary Azure SQL Database. You need to select a server (which can be a server you already created for your ordinary Azure SQL Databases) and you need to specify a performance tier. There are two main levels of performance:
- Gen 1 / Optimized for Elasticity is the standard level
- Gen 2 / Optimized for Compute is the premium level (more expensive and even faster than the standard level)
The performance also depends on the number of DWU units. The minimum is 100 DWU, which means you will get one compute node. At 200 DWU you will get two compute nodes (which gives you about twice the performance of a single compute node), and so on. Azure SQL Data Warehouse always splits your data into 60 distributions (databases), which will get evenly distributed among your compute nodes.
Note that you easily can change the performance tier later. For your first testing, I recommend starting with a minimum of 200 DWU:s, so you get at least 2 compute nodes.
You can pause your Azure SQL Data Warehouse anytime to suspend costs (although you still pay for the storage) and then resume it again when you need it. So you can use it to process a batch of data and then pause it after you have finished. If you still need the data to be readable, you can scale down to 100 DWU:s instead of pausing.
Creating your Azure SQL Data Warehouse should take less than 10 minutes.
Finally, if you have created a new server, you will have to configure the firewall so that you can connect to your Azure SQL Data Warehouse. Click on your newly created server and then select “Firewalls and virtual networks”. Then add necessary client IP addresses.
You can connect through the Azure portal, but my preferred way to connect to the Azure SQL Data Warehouses is to use Management Studio. You can connect to Azure SQL Data Warehouse just like any database.
To test the connection you can select your database and run this command:
It should return something like “Microsoft Azure SQL Data Warehouse – 10.0.1091.15 Apr 4 2018 05:39:08 Copyright (c) Microsoft Corporation”.
Creating your first billion rows
CREATE TABLE dbo.Billion WITH ( DISTRIBUTION = ROUND_ROBIN ,CLUSTERED COLUMNSTORE INDEX ) AS WITH ten(a) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ) SELECT r = 1 FROM ten AS t1 CROSS JOIN ten AS t2 CROSS JOIN ten AS t3 CROSS JOIN ten AS t4 CROSS JOIN ten AS t5 CROSS JOIN ten AS t6 CROSS JOIN ten AS t7 CROSS JOIN ten AS t8 CROSS JOIN ten AS t9
This should take around 3-4 minutes to run at 400 DWUs.
Note the settings DISTRIBUTION = ROUND_ROBIN and CLUSTERED COLUMNSTORE INDEX. We’ll talk about the later.
Check the distribution of your rows
To check the space used and distribution of your rows, run this command:
The first column shows the number of rows from your table. The RESERVED_SPACE tells you how many KBs are consumed. The PDW_NODE_ID tells you which node the data is stored on. Since we used ROUND_ROBIN distribution, the rows will be very equally distributed.
Distribution and table type
When you run the CREATE TABLE AS SELECT statement, you can choose distribution type and table type.
- ROUND_ROBIN (table is spread evenly among the distributions)
- HASH (table is distributed using a hash key)
- REPLICATE (the whole table is copied to all nodes, which means it will take more diskspace)
In most cases you should use HASH for fact tables and REPLICATE for dimension tables. The choice of HASH key is very important.
- CLUSTERED COLUMNSTORE INDEX
- CLUSTERED INDEX
The table types are the same as for ordinary SQL Server databases. HEAPS are usually the fastest for writes. CLUSTERED COLUMNSTORE INDEXES are usually fastest for reads.
Note that the MPP architecture doesn’t allow PRIMARY KEYS or FOREIGN KEYS on your tables. Also IDENTITY columns should be avoided if possible.
Creating a fact table
Let’s now create a fact table. We can use the previous table (dbo.Billion) as source to get a billing fact rows.
CREATE TABLE dbo.Fact_Sales WITH ( DISTRIBUTION = ROUND_ROBIN ,CLUSTERED COLUMNSTORE INDEX ) AS SELECT SalesID = a ,SalesDate = DATEADD(DAY, b, CAST('2017-01-01' AS DATE)) ,CustomerID = b / 2 ,ChannelID = b / 50 + 1 ,Amount = CAST(b AS BIGINT) FROM dbo.Billion CROSS APPLY ( SELECT a = NEWID() ) AS ca1 CROSS APPLY ( SELECT b = CAST(CAST(a AS BINARY(1)) AS INT) ) AS ca2;
The fact table should take 7-8 minutes to create at 400 DWUs.
Creating dimension tables
For simplicity, I will only use T-SQL to create our sample dimension tables.
CREATE TABLE dbo.Dim_Customer WITH ( DISTRIBUTION = REPLICATE ,CLUSTERED INDEX (CustomerID) ) AS WITH sixteen(i) AS ( SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12 UNION ALL SELECT 13 UNION ALL SELECT 14 UNION ALL SELECT 15 ), twohundredfiftysix(i) AS ( SELECT t1.i * 16 + t2.i FROM sixteen AS t1 CROSS JOIN sixteen AS t2 ) SELECT CustomerID = i ,CustomerName = 'Customer ' + CAST(i AS VARCHAR(3)) FROM twohundredfiftysix;
And the second dimension table:
CREATE TABLE dbo.Dim_Channel WITH ( DISTRIBUTION = REPLICATE ,HEAP ) AS SELECT ChannelID ,ChannelName FROM ( SELECT ChannelID = 1, ChannelName = 'Webshop' UNION ALL SELECT ChannelID = 2, ChannelName = 'Telemarketing' UNION ALL SELECT ChannelID = 3, ChannelName = 'Distributor' UNION ALL SELECT ChannelID = 4, ChannelName = 'Mailorder' UNION ALL SELECT ChannelID = 5, ChannelName = 'VIP store' UNION ALL SELECT ChannelID = 6, ChannelName = 'Educational' ) AS t; DBCC PDW_SHOWSPACEUSED('dbo.Dim_Channel');
The whole list of tables should now look like this. Note that the icon is different on DISTRIBUTED tables than on REPLICATED tables.
Querying your data
You can now try your first query. It should complete in less than a second.
SELECT COUNT(*) FROM dbo.Fact_Sales;
Let’s also aggregate the Amount column.
SELECT [Amount] = SUM(Amount), [Count] = COUNT(*) FROM dbo.Fact_Sales;
Finally, let’s do some joins and GROUP BY.
SELECT CustomerName, ChannelName, [Amount] = SUM(Amount), [Count] = COUNT(*) FROM dbo.Fact_Sales AS s JOIN dbo.Dim_Customer AS cu ON s.CustomerID = cu.CustomerID JOIN dbo.Dim_Channel AS ca ON s.ChannelID = ca.ChannelID GROUP BY cu.CustomerName, ca.ChannelName;
This last query will be a little slower the first time you run it (because the replicated tables have to be copied to all nodes), so run it twice. The second time it should take about 1 second. Not bad for joining, grouping and aggregating 1 billion rows!
ETL/ELT Data processing
With Azure SQL Data Warehouse, it is better to perform ELT (Extract Load Transform) than ETL. To get best possible performance, you want to execute the transformations and logics in stored procedures inside the Azure SQL Data Warehouse.
Azure SQL Data Warehouse works with SSIS, and there is even a special optimized SSIS PDW connector available. Also data can be loaded through Azure Data Factory, bcp, or PolyBase.
More information on how to get your data into your Azure SQL Data Warehouse is available here.
Using a BI client tool
Azure SQL Data Warehouses support most BI tools, since it is mostly compatible with an ordinary SQL Server. It also supports loading the data into SQL Server Analysis Services (SSAS), which you could even run in Direct Query-mode to get real-time data and push the execution of query logic so it gets executed inside the Azure SQL Data Warehouse.
With Power BI, it’s very simple to connect. Just click on Get Data in the bottom left corner.
There are many Dynamic Management Views (DMVs) for monitoring the progress and status of your queries.
You can show all running queries with this command:
SELECT * FROM sys.dm_pdw_exec_requests;
Or you can get a list of your nodes by using this query:
SELECT * FROM sys.dm_pdw_nodes;