Sunday, 29 October 2017

Calculating Hash values in SSIS

Case
I want to calculate a hash value for a couple of columns in SSIS. In T-SQL you can use HASHBYTES, but that doesn't work for other sources like flat files and for SQL 2012/2014 the input is limited to only 8000 bytes. Is there an alternative for HASHBYTES?
Calculating a hash value over multiple columns















Solution
There are several alternatives for the T-SQL HASHBYTES. First of all there are various custom components available for SSIS like the SSIS Multiple Hash on codeplex, but if you don't want to (or cannot) use custom components, you can accomplish the same result with a little .NET scripting. If you really want to stick to T-SQL, then you can also first stage your files in a table and then calculate the hash with T-SQL afterwards. This blog will show you the scripting solution.

But first, why do you need a hash? When you want to keep track of history with a Persistent stage, Data Vault or Data Warehouse you want to know whether the record from the stage layer is different then the one you have in your historical data layer. You could check each column one by one, but when you have a whole bunch of columns that could be a lot of work and a bit slow.

A hash in ETL is used to generate a single, corresponding (but shorter) value for a whole bunch of columns. It is stored in the stage table as a new column. If one character changes in one of those columns then the hash value will also be different. When comparing the two records (one from the stage layer and one from the historical layer) you now only have to compare the hash value. If it did not change you know you don't have to process the record in your historical layer. Since you only want to calculate the hash once (in the stage package) you will also store it in the historical layer.

Now it is time to explain the scripting solution

1) Starting point
The starting point of this example is a Data Flow Task with a Flat File source component.
Flat File Source












2) Script Component - Input Columns
Add a new Script Component (transformation) to the Data Flow Task. Give it a suitable name and connect it to your flow. Then edit it and select all columns you want to hash on the Input Columns pane. Since we are not changing the existing column you can keep the default Usage Type 'ReadOnly'.
Script Component Input Columns

















Which columns do you want to hash? Three most chosen options:
  1. If you do not know the Primary Key: select all columns to calculate the hash.
  2. If you do know the Primary Key: select all columns except the Primary Key to calculate the hash.
  3. If the Primary Key consists of multiple columns you could even calculate a separate hash for the key columns only.
3) Script Component - Output Column
We need to store the calculated hash in a new column. Go to the Inputs and Outputs pane and add a new column in Output 0. The data type is string and the size depends on which hash algoritme you want to use. For this example we use the MD5 algoritme which returns a 128 bits hash. When you convert that to an ASCII string it would be a 32 character string (that only contains hexadecimal digits).
Script Component Inputs and Outputs























4) Script Component - The script preparation
Now we are almost ready to add the actual script. Go to the Script pane. Select your scripting language. This example will be in C#. Then hit the Edit Script... button to start the Vsta environment. This is a new instance of Visual Studio and will take a few moments to start.
Edit Script...























Optional:
I always start by removing all unnecessary methods and comments to keep the code clean. For this example we do not need the PreExecute and PostExecute methods and I do not want to keep the default help comments.
Clean up before start



















5) Script Component - The code
First we need to add two extra usings to shorten the code. Unfold the Namespaces region at the top and add the following usings:
using System.Security.Cryptography;
using System.Text;

Then Locate the method called Input0_ProcessInputRow and add a new GetMd5Hash method below this existing method (below its closing }). The new method is copied from this MSDN page. I only changed the encoding to Unicode (see note 1):
static string GetMd5Hash(MD5 md5Hash, string input)
{
    // Convert the input string to a byte array and compute the hash.
    byte[] data = md5Hash.ComputeHash(Encoding.Unicode.GetBytes(input));

    // Create a new Stringbuilder to collect the bytes
    // and create a string.
    StringBuilder sBuilder = new StringBuilder();

    // Loop through each byte of the hashed data
    // and format each one as a hexadecimal string.
    for (int i = 0; i < data.Length; i++)
    {
        sBuilder.Append(data[i].ToString("x2"));
    }

    // Return the hexadecimal string.
    return sBuilder.ToString();
}


And at last change the code of the existing method Input0_ProcessInputRow to:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string Separator = "|";
    string RowData = "";

    using (MD5 md5Hash = MD5.Create())
    {
        Row.Hash = GetMd5Hash(md5Hash, RowData = (Row.Title_IsNull ? "" : Row.Title.ToString()) + Separator + (Row.FirstName_IsNull ? "" : Row.FirstName.ToString()) + Separator + (Row.MiddleName_IsNull ? "" : Row.MiddleName.ToString()) + Separator + (Row.LastName_IsNull ? "" : Row.LastName.ToString()));
    }
}


The code above first concatenates all columns with a separator between them (see note 2) and it checks whether the value isn't NULL because we cannot add NULL to a string (see note 3). You will see that it repeats this piece of code for each column before calling the hash method:
(Row.Title_IsNull ? "" : Row.Title.ToString()) + Separator
For the first record in our example it will hash the following text: Mr.|Syed|E|Abbas
And for the third row that contains a null value it will hash the this text: Ms.|Kim||Abercrombie

6) Testing the code
After closing the Vsta editor and clicking OK in the Script Component to close it, add a dummy Derived Column behind it and add a Data Viewer to see the result.
Similar result to T-SQL HASHBYTES


















Note 1:
When you want the exact same result as with T-SQL HASHBYTES then you have to make sure you use the same encoding. Otherwise you get a different hash. In the method GetMd5Hash on the first line of code you see Encoding.Unicode.GetBytes(. There are more options besides Unicode. For example: ASCII, UTF7, UTF8, UTF32, etc. etc. However, as long as you don't have to compare hashes generated by to different methods (T-SQL and .Net) it doesn't matter. In this stackoverflow post you find more examples.

Note 2:
The column separator is added to prevent unwanted matches. If you have these two records with two columns:
Column1 Column2
123 456
12 3456
Without the separator these two will both get concatenated to 123456 and therefor generate the same hash. With the separator you will have two different values to hash: 123|456 and 12|3456. Choose your separator wisely. The number 3 would not be a wise choice in this case.

Note 3:
In the code you see that the columns are checked for null values because you cannot add null to a string. The null values are replace with an empty string. However this shows a bit of an imperfection of this method, because a string with a null value isn't the same as an empty string. To overcome this you could use a different string that is likely to occur in your text. For Numeric and Date data types you could just add an empty string, something like:
(Row.MyNumberColumn_IsNull ? "" : Row.MyNumberColumn.ToString()) + Separator
(Row.MyDateColumn_IsNull ? "" : Row.MyDateColumn.ToString()) + Separator


Note 4:
md5 only uses 128 bits and there are better, saver (, but also a bit slower) methods to calculate hashes:
SHA and SHA1 - 160 bits
SHA2_256 - 256 bits
SHA2_512 - 512 bits

Safer? As long as you don't use it to hash passwords you are still OK with md5.
Better? In rare cases two different strings could return the same md5 hash, but you have a higher chance to win the galaxy lottery.

Rather use SHA2_512? Just use this code instead:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string Separator = "|";
    string RowData = "";

    using (SHA512 shaHash = new SHA512Managed())
    {
        Row.hash2 = GetShaHash(shaHash, RowData = (Row.Title_IsNull ? "" : Row.Title.ToString()) + Separator + (Row.FirstName_IsNull ? "" : Row.FirstName.ToString()) + Separator + (Row.MiddleName_IsNull ? "" : Row.MiddleName.ToString()) + Separator + (Row.LastName_IsNull ? "" : Row.LastName.ToString()));
    }
}

static string GetShaHash(SHA512 shaHash, string input)
{
    // Convert the input string to a byte array and compute the hash.
    byte[] data = shaHash.ComputeHash(Encoding.Unicode.GetBytes(input));

    // Create a new Stringbuilder to collect the bytes
    // and create a string.
    StringBuilder sBuilder = new StringBuilder();

    // Loop through each byte of the hashed data
    // and format each one as a hexadecimal string.
    for (int i = 0; i < data.Length; i++)
    {
        sBuilder.Append(data[i].ToString("x2"));
    }

    // Return the hexadecimal string.
    return sBuilder.ToString();
}

Too much columns => too much coding?
In my book Extending SSIS with .NET Scripting you will find a script component example that just loops through all columns to generates the hash. No money to buy it? I used this code as the base for that script.
An other alternative is to generate the Script Component and its code with BIML. Here is an example of a Script Component in BIML, but getting the hash to work is a bit of a challenge but doable.

Tuesday, 29 August 2017

Azure Data Lake Store in SSIS

Case
Microsoft just released a new Azure Feature Pack for SSIS with ADLS Connectivity.  What's new?
Azure Data Lake Store File System Task












Solution
It contains four new items and a new connection manager:
- Azure Data Lake Store File System Task
- Foreach ADLS File Enumerator
- Azure Data Lake Store Source
- Azure Data Lake Store Destination

Azure Data Lake Store File System Task
This task only allows you to upload or download files to the Azure Data Lake Store. This is similar to the Azure Blob Upload / Download Task. In the near future new operations will be added. A delete file or delete folder would be a handy addition

1) Start
First download and install the new Azure Feature Pack. Then check the Azure Folder in the SSIS Toolbox and drag the Azure Data Lake Store File System Task to the surface. Give it a suitable name.
Azure Data Lake Store File System Task






















2) Operation and source
Edit the new task and select an Operation. For this example I will use the CopyToADLS operation. Then we first need to specify where the files are located on the local machine. This is a hardcoded path but can be overwritten with an expression. The FileNamePattern is a wildcard with ? or *. I use *.csv to upload all csv files in that folder. SearchRecursively allows you to find files in subfolders.
Specify local source






















3) Destination - Connection manager
Next we need to create a new ADLS connection manager or select an existing one.
ADLS Connection Manager






















As host you can use the URL property from the ADLS Overview page. Go to the Azure Portal and copy that URL
URL = ADLS Host











For this example I will use the easier Authentication: Azure AD User Identity. It uses you email address and password from Azure. The Azure AD Service Identity will be handled in a later post.
ADLS Connection Manager





















When hitting OK or Test Connection it will open an Azure Login page, where you need to login and confirm that SSIS can connect to that ADLS.

4) Destination - ADLS folder
Next we need to specify a folder name or path. You can either specify the name of an existing folder or a new folder name that will be created when executed. To find which existing folders you have, you can use the Data Explorer page in ADLS.
Data Explorer










Specify Folder






















The FileExpiry option lets you specify the data that will be used the expire the files in ADLS. You can leave it empty to never expire.

5) The result
Now run the task/package to see the result. Use Data Explorer in ADLS to see the actual result.
Data Explorer












Foreach ADLS File Enumerator
The Foreach ADLS File Enumerator is a new enumerator for the Foreach Loop Container. It allows you to loop through an ADLS folder and return the paths of the files. It is very similar to the Azure Blob Enumerator. You can use this enumerator with the Azure Data Lake Store Source in the Data Flow Task.

1) Select Enumerator
When you select the ADLS File Enumerator. You need to specify the Connection Manager (see above, step 3 of task). The remote folder (use the Data Explorer to find an existing folder). And then the wildcard and the Search recursive option.
Collection






















2) Variable Mappings
In the Variable Mappings pane you need to map the first item of the collection (zero based) to an SSIS string variable.
Variable Mappings






















3) The Result
To show the content of the variable during execution, I added a simple Script Task and a little C# code: MessageBox.Show(Dts.Variables["User::filepath"].Value.ToString());
MessageBox.Show



























C) Azure Data Lake Store Source
This allows you to use files from the Azure Data Lake Store as a source in SSIS. Again very similar to the Azure Blob Source.

1) Edit Source
Drag the Azure Data Lake Store Source to the surface and give it a suitable name. Then edit the source and specify the connection manager, File Path and format. You cannot specify the data type or size. In this first test every thing became (DT_WSTR,100).
Azure Data Lake Store Source























2) The Result
To test the result (with a very small file) I added a dummy Derived Column and a Data Viewer.
Azure Data Lake Store Source

















D) Azure Data Lake Store Destination
This allows you to stream your Data Flow Task data to Azure Data Lake Store. Again very similar to the Azure Blob Destination.

1) Edit Destination
Add a Azure Data Lake Store Destination after your source or transformation and give it a suitable name. You can specify the connection manager, file path and the file format option.
ADLS destination - test3.csv

















2) The Result
To test the result run the package and open the Data Explorer in ADLS to see the result
Data Explorer











Conclusion
Much needed ADLS extension for the Azure Feature Pack, but nothing spectacular compared to the Blob Storage items in this feature pack. Hopefully the Azure Data Lake Store File System Task will soon be extended with new actions and perhaps they could also introduce the Azure Blob Storage File System Task.
Related Posts Plugin for WordPress, Blogger...