SCD Using MERGE and Table Data Types in SSIS - 1 of 2
Download Source Files: Here
In the first of what hope to be many posts. I’ve taken my first step into really utilising some of the many enhancements delivered in the most recent release of Microsoft SQL Server.
Two of the most eagerly awaited enhancements to come out from Microsoft have been the advent of table data types and the new MERGE syntax.
Table data types now allow SQL developers to pass table variables directly into stored procedures as a parameter, in the past ingenious way had been devised to do this with the use of temp tables or with XML, but table data types now make these process redundant and offer great performance improvements.
For the data warehouse developers out there the MERGE statement now allows INSERT, UPDATE and DELETE statements to be applied to a table in a single pass, simplify the SCD (Slowly Changing Dimension) process.
The Challenge
The challenge I set my self was to populate a table variable with data representing new dimensional records, pass through the variable into a stored procedure that would then use the table variable to compare against the current records in a dimension using the new MERGE syntax and apply the appropriate upserts (updates and inserts). Going beyond that I wanted to be able to call this stored procedure directly within an SSIS package without writing the new records out to a temporary staging table.
Creating the sample database environment
The Customers dimension for a data warehouse is a typical candidate for a type 2 slowly changing dimension. For this example I will use the Customers table in the Northwind database as my source data set. I’ve created my own sample database where I’ve created a DimCustomer table that will be the destination for our data, I will also create the table data type and stored proc on this database.
Lets get started, first create the sample database:
IF DB_ID('Sample') IS NULL
CREATE DATABASE Sample;
GO
Now create the DimCustomer dimension table, this will be the target for our records.
IF OBJECT_ID('DimCustomer',N'U') IS NOT NULL
DROP TABLE DimCustomer;
CREATE TABLE DimCustomer
(
CustomerID VARCHAR(5),
ContactName VARCHAR(40),
ContactTitle VARCHAR(30),
Address VARCHAR(60),
City VARCHAR(15),
Region VARCHAR(15),
PostalCode VARCHAR(10),
Country VARCHAR(15),
Phone VARCHAR(24),
Fax VARCHAR(24),
CurrentRecord BIT,
ValidFromDate DATE,
ValidToDate DATE
);
GO
You will notice that this table has a valid from and to date that will be used to track the period that the data is valid, there is also a flag CurrentRecord that will indicate the current record for each customer.
Now we need to create the table data type that will be used to pass into our stored procedure
IF TYPE_ID('DimCustomer') IS NOT NULL
DROP TYPE dbo.DimCustomer
CREATE TYPE dbo.DimCustomer AS TABLE
(
CustomerID VARCHAR(5),
ContactName VARCHAR(40),
ContactTitle VARCHAR(30),
Address VARCHAR(60),
City VARCHAR(15),
Region VARCHAR(15),
PostalCode VARCHAR(10),
Country VARCHAR(15),
Phone VARCHAR(24),
Fax VARCHAR(24),
ValidFromDate DATE
);
GO
Notice that I have left out the CurrentRecord, ValidToDate attributes from the table data type, these fields will be managed within the stored procedure itself.
Now to create the actual stored procedure, and thanks to Warren Thornwaite of the Kimball Group who posted a really great article on using the MERGE statement for SCD, which can be found here:
http://www.kimballgroup.com/html/08dt/KU107_UsingSQL_MERGESlowlyChangingDimension.pdf
IF OBJECT_ID('sp_SCDCustomer',N'P') IS NOT NULL
DROP PROC sp_SCDCustomer
GO
CREATE PROC dbo.sp_SCDCustomer(@NewCustomer AS dbo.DimCustomer READONLY)
AS
BEGIN
INSERT INTO DimCustomer --These are the new updated records
(
CustomerID,
ContactName,
ContactTitle,
Address,
City,
Region,
PostalCode,
Country,
Phone,
Fax,
CurrentRecord,
ValidFromDate,
ValidToDate
)
SELECT
CustomerID,
ContactName,
ContactTitle,
Address,
City,
Region,
PostalCode,
Country,
Phone,
Fax,
CurrentRecord,
ValidFromDate,
ValidToDate
FROM
(
MERGE DimCustomer AS ExistingCustomer
USING @NewCustomer AS NewCustomer
ON ExistingCustomer.CustomerID = NewCustomer.CustomerID
WHEN NOT MATCHED
THEN --Brand New Customers
INSERT
(
CustomerID,
ContactName,
ContactTitle,
Address,
City,
Region,
PostalCode,
Country,
Phone,
Fax,
CurrentRecord,
ValidFromDate,
ValidToDate
)
VALUES
(
NewCustomer.CustomerID,
NewCustomer.ContactName,
NewCustomer.ContactTitle,
NewCustomer.Address,
NewCustomer.City,
NewCustomer.Region,
NewCustomer.PostalCode,
NewCustomer.Country,
NewCustomer.Phone,
NewCustomer.Fax,
1, -- CurentRecord
NewCustomer.ValidFromDate, -- ValidFromDate
'2050-01-01' -- ValidToDate
)
WHEN MATCHED -- Expired records
AND ExistingCustomer.CurrentRecord = 1
AND ExistingCustomer.CustomerID = NewCustomer.CustomerID
AND( ExistingCustomer.ContactName != NewCustomer.ContactName
OR ExistingCustomer.ContactTitle != NewCustomer.ContactTitle
OR ExistingCustomer.Address != NewCustomer.Address
OR ExistingCustomer.City != NewCustomer.City
OR ExistingCustomer.Region != NewCustomer.Region
OR ExistingCustomer.PostalCode != NewCustomer.PostalCode
OR ExistingCustomer.Country != NewCustomer.Country
OR ExistingCustomer.Phone != NewCustomer.Phone
OR ExistingCustomer.Fax != NewCustomer.Fax
)
THEN
UPDATE
SET CurrentRecord = 0,
ValidToDate = CAST(DATEADD(d,-1,NewCustomer.ValidFromDate) AS DATE)
OUTPUT --Newly updated records, used by the outer insert statement
$Action Action_Out,
NewCustomer.CustomerID,
NewCustomer.ContactName,
NewCustomer.ContactTitle,
NewCustomer.Address,
NewCustomer.City,
NewCustomer.Region,
NewCustomer.PostalCode,
NewCustomer.Country,
NewCustomer.Phone,
NewCustomer.Fax,
1 AS CurrentRecord,
NewCustomer.ValidFromDate AS ValidFromDate,
CAST('2050-01-01' AS DATE) AS ValidToDate
) AS MERGE_OUT
WHERE MERGE_OUT.Action_Out = 'UPDATE';
END
Now you have created all the plumbing on the SQL Server, we can turn our attention to building the SSIS package that will call this.
Creating the SSIS Package
First we are going to have to create to connections, one to the northwind database called Northwind.OLEDB and the other to the Sample database that we just created called Sample.NET. Its important you use .Net Provider for this second connection otherwise the script task will fail later.
Now that we’ve created the connection we need to declare two package level variable
1) objTable an OBJECT data type that will store our transformed record from the Northwind database,
2) dtDate of DATETIME data type, which wil lstore a data that represent the run date of the EL process. This will be used as the validFromDate on our transformed records.
Drag a dataflow component and a script component onto the design surface.
In the Data flow we are going to add an OLE DB Source that will pull the data from the Northwind database. A derived column that will use the dtDate variable to assign a ValidFromDate and a Recordset Destination that will use the ojbTable variable to hold all the data to be passed into the stored procedure.
In the OLE DB Source use the following SQL to select data from the Northwind database:
SELECT
Customers.CustomerID,
Customers.ContactName,
Customers.ContactTitle,
Customers.Address,
Customers.City,
Customers.Region,
Customers.PostalCode,
Customers.Country,
Customers.Phone,
Customers.Fax
FROM
Customers WITH (NOLOCK)
In the Derived Column component drag the dtDate variable into the expression column, ensure to rename the column ValidFromDate
Finally in the RecordSet Destination select all the columns that are available in the pipeline.
Now go back to the Control flow and the Script Task, ensure that you map the objTable variable as a ReadOnlyVaraible. Now enter into the script editor.
In the sub main past the following code. Having done this the package is now ready to run.
Public Sub Main()
Dim conn As New SqlConnection
Dim cmd As New SqlCommand
Dim dt As New DataTable
Dim ad As New Data.OleDb.OleDbDataAdapter
'Ensure this is an ADO.NET connection
conn = DirectCast(Dts.Connections("Sample.NET").AcquireConnection(Dts.Transaction), SqlClient.SqlConnection)
ad.Fill(dt, Dts.Variables("User::objTable").Value)
cmd = conn.CreateCommand()
cmd.CommandType = System.Data.CommandType.StoredProcedure
cmd.CommandText = "dbo.sp_SCDCustomer"
cmd.Parameters.AddWithValue("@NewCustomer", dt)
cmd.ExecuteNonQuery()
Dts.TaskResult = ScriptResults.Success
End Sub
Execute the package, you should now see all the data populated in the DimCustomer table.
Make some changes to the data back in the Northwind database and increment the value in the dtDate variable back in the SSIS package, and now re-run the package. You should find that the changed records are inserted into the DimCustomer table and the old records ValidToDate and CurrentRecord flag have been updated.
In a future post I’m going to show you how to do the upsert directly in the data flow, without using the recordset object.
Thanks for sharing a good article. This is very interesting and I like this type of article only. I have always read important article like this.
ReplyDeleteC C++ Training in Chennai
C Training in Chennai
core java training in chennai
javascript training in chennai
javascript course in chennai
core java training in chennai
core java training
Hello! This is my first visit to your blog! We are a team of volunteers and starting a new initiative in a community in the same niche. Your blog provided us useful information to work on. You have done an outstanding job.
ReplyDeleteC and C++ Training Institute in chennai | C and C++ Training Institute in anna nagar | C and C++ Training Institute in omr | C and C++ Training Institute in porur | C and C++ Training Institute in tambaram | C and C++ Training Institute in velachery