Introduction to Azure SQL Data Warehouse

By | April 12, 2018

On the surface it may look similar to an ordinary SQL Database, but inside the Azure SQL Data Warehouse is a tremendously powerful Massively Parallel Processing (MPP) engine. Originally there was the SQL Server Parallel Data Warehouse (PDW) applicance, which cost in the $1 million USD range (depending on how many compute nodes). Later it was renamed to Analytic Platform System (APS). In 2015 it was made available (first as a preview) also as a cloud service under the name Azure SQL Data Warehouse. So, with the Azure SQL Data Warehouse you can get the capacity of a multi-million dollar server.

The figure describes the general architecture of the Azure SQL Data Warehouse. It might look complicated, but most of this figure are things you don’t need to worry about.

Azure SQL Data Warehouse architecture

Creating your first Azure SQL Data Warehouse

You can easily create an Azure SQL Data Warehouse like any other resource through the Azure Portal or through Azure Resource Manager. Just select Databases, and it should appear in the list.

Azure Portal create SQL data warehouse

The options are similar to creating an ordinary Azure SQL Database. You need to select a server (which can be a server you already created for your ordinary Azure SQL Databases) and you need to specify a performance tier. There are two main levels of performance:

  • Gen 1 / Optimized for Elasticity is the standard level
  • Gen 2 / Optimized for Compute is the premium level (more expensive and even faster than the standard level)

The performance also depends on the number of DWU units. The minimum is 100 DWU, which means you will get one compute node. At 200 DWU you will get two compute nodes (which gives you about twice the performance of a single compute node), and so on. Azure SQL Data Warehouse always splits your data into 60 distributions (databases), which will get evenly distributed among your compute nodes.

Note that you easily can change the performance tier later. For your first testing, I recommend starting with a minimum of 200 DWU:s, so you get at least 2 compute nodes.

You can pause your Azure SQL Data Warehouse anytime to suspend costs (although you still pay for the storage) and then resume it again when you need it. So you can use it to process a batch of data and then pause it after you have finished. If you still need the data to be readable, you can scale down to 100 DWU:s instead of pausing.

Azure Portal create SQL data warehouse 2

Creating your Azure SQL Data Warehouse should take less than 10 minutes.

Finally, if you have created a new server, you will have to configure the firewall so that you can connect to your Azure SQL Data Warehouse. Click on your newly created server and then select “Firewalls and virtual networks”. Then add necessary client IP addresses.

0733-464650

Connecting

You can connect through the Azure portal, but my preferred way to connect to the Azure SQL Data Warehouses is to use Management Studio. You can connect to Azure SQL Data Warehouse just like any database.

Azure SQL data warehouse connect

To test the connection you can select your database and run this command:

SELECT @@VERSION

It should return something like “Microsoft Azure SQL Data Warehouse – 10.0.1091.15 Apr 4 2018 05:39:08 Copyright (c) Microsoft Corporation”.

Creating your first billion rows

CREATE TABLE dbo.Billion
WITH (
 DISTRIBUTION = ROUND_ROBIN
 ,CLUSTERED COLUMNSTORE INDEX
)
AS
WITH ten(a) AS
(
 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)
SELECT r = 1
FROM ten AS t1 CROSS JOIN ten AS t2 CROSS JOIN ten AS t3 CROSS JOIN ten AS t4 CROSS JOIN ten AS t5 CROSS JOIN ten AS t6 CROSS JOIN ten AS t7 CROSS JOIN ten AS t8 CROSS JOIN ten AS t9

This should take around 3-4 minutes to run at 400 DWUs.

Note the settings DISTRIBUTION = ROUND_ROBIN and CLUSTERED COLUMNSTORE INDEX. We’ll talk about the later.

Check the distribution of your rows

To check the space used and distribution of your rows, run this command:

DBCC PDW_SHOWSPACEUSED('dbo.Billion');

Azure SQL Data Warehouse dbcc showspaceused

The first column shows the number of rows from your table. The RESERVED_SPACE tells you how many KBs are consumed. The PDW_NODE_ID tells you which node the data is stored on. Since we used ROUND_ROBIN distribution, the rows will be very equally distributed.

Distribution and table type

When you run the CREATE TABLE AS SELECT statement, you can choose distribution type and table type.

Distribution types:

  • ROUND_ROBIN (table is spread evenly among the distributions)
  • HASH (table is distributed using a hash key)
  • REPLICATE (the whole table is copied to all nodes, which means it will take more diskspace)

In most cases you should use HASH for fact tables and REPLICATE for dimension tables. The choice of HASH key is very important.

Table types:

  • CLUSTERED COLUMNSTORE INDEX
  • HEAP
  • CLUSTERED INDEX

The table types are the same as for ordinary SQL Server databases. HEAPS are usually the fastest for writes. CLUSTERED COLUMNSTORE INDEXES are usually fastest for reads.

Note that the MPP architecture doesn’t allow PRIMARY KEYS or FOREIGN KEYS on your tables. Also IDENTITY columns should be avoided if possible.

Creating a fact table

Let’s now create a fact table. We can use the previous table (dbo.Billion) as source to get a billing fact rows.

CREATE TABLE dbo.Fact_Sales
WITH (
DISTRIBUTION = ROUND_ROBIN
,CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT
SalesID = a
,SalesDate = DATEADD(DAY, b, CAST('2017-01-01' AS DATE))
,CustomerID = b / 2
,ChannelID = b / 50 + 1
,Amount = CAST(b AS BIGINT)
FROM dbo.Billion
CROSS APPLY (
SELECT a = NEWID()
) AS ca1
CROSS APPLY (
SELECT b = CAST(CAST(a AS BINARY(1)) AS INT)
) AS ca2;

The fact table should take 7-8 minutes to create at 400 DWUs.

Creating dimension tables

If you want more realistic dimension tables, you could upload some textfiles into BLOB Storage using Azure Storage Explorer and import into Azure SQL Data Warehouse using PolyBase T-SQL queries.

For simplicity, I will only use T-SQL to create our sample dimension tables.

CREATE TABLE dbo.Dim_Customer
WITH (
DISTRIBUTION = REPLICATE
,CLUSTERED INDEX (CustomerID)
)
AS
WITH sixteen(i) AS
(
SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL
SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL
SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL
SELECT 12 UNION ALL SELECT 13 UNION ALL SELECT 14 UNION ALL SELECT 15
),
twohundredfiftysix(i) AS
(
SELECT t1.i * 16 + t2.i
FROM sixteen AS t1 CROSS JOIN sixteen AS t2
)
SELECT
CustomerID = i
,CustomerName = 'Customer ' + CAST(i AS VARCHAR(3))
FROM twohundredfiftysix;

And the second dimension table:


CREATE TABLE dbo.Dim_Channel
WITH (
DISTRIBUTION = REPLICATE
,HEAP
)
AS
SELECT
ChannelID
,ChannelName
FROM (
SELECT ChannelID = 1, ChannelName = 'Webshop' UNION ALL
SELECT ChannelID = 2, ChannelName = 'Telemarketing' UNION ALL
SELECT ChannelID = 3, ChannelName = 'Distributor' UNION ALL
SELECT ChannelID = 4, ChannelName = 'Mailorder' UNION ALL
SELECT ChannelID = 5, ChannelName = 'VIP store' UNION ALL
SELECT ChannelID = 6, ChannelName = 'Educational'
) AS t;

DBCC PDW_SHOWSPACEUSED('dbo.Dim_Channel');

The whole list of tables should now look like this. Note that the icon is different on DISTRIBUTED tables than on REPLICATED tables.

SQL Data Warehouse tables

Querying your data

You can now try your first query. It should complete in less than a second.

SELECT COUNT(*) FROM dbo.Fact_Sales;

Let’s also aggregate the Amount column.

SELECT [Amount] = SUM(Amount), [Count] = COUNT(*) FROM dbo.Fact_Sales;

Finally, let’s do some joins and GROUP BY.

SELECT CustomerName, ChannelName, [Amount] = SUM(Amount), [Count] = COUNT(*)
FROM dbo.Fact_Sales AS s
JOIN dbo.Dim_Customer AS cu ON s.CustomerID = cu.CustomerID
JOIN dbo.Dim_Channel AS ca ON s.ChannelID = ca.ChannelID
GROUP BY cu.CustomerName, ca.ChannelName;

This last query will be a little slower the first time you run it (because the replicated tables have to be copied to all nodes), so run it twice. The second time it should take about 1 second. Not bad for joining, grouping and aggregating 1 billion rows!

ETL/ELT Data processing

With Azure SQL Data Warehouse, it is better to perform ELT (Extract Load Transform) than ETL. To get best possible performance, you want to execute the transformations and logics in stored procedures inside the Azure SQL Data Warehouse.

Azure SQL Data Warehouse works with SSIS, and there is even a special optimized SSIS PDW connector available. Also data can be loaded through Azure Data Factory, bcp, or PolyBase.

More information on how to get your data into your Azure SQL Data Warehouse is available here.

Using a BI client tool

Azure SQL Data Warehouses support most BI tools, since it is mostly compatible with an ordinary SQL Server. It also supports loading the data into SQL Server Analysis Services (SSAS), which you could even run in Direct Query-mode to get real-time data and push the execution of query logic so it gets executed inside the Azure SQL Data Warehouse.

With Power BI, it’s very simple to connect. Just click on Get Data in the bottom left corner.

Azure SQL Data Warehouse Power BI

Azure SQL Data Warehouse Power BI 2

Azure SQL Data Warehouse Power BI 3

DMVs

There are many Dynamic Management Views (DMVs) for monitoring the progress and status of your queries.

You can show all running queries with this command:

SELECT * FROM sys.dm_pdw_exec_requests;

Or you can get a list of your nodes by using this query:

SELECT * FROM sys.dm_pdw_nodes;

References