Using HASHBYTES() to compare columns
Recently, while at a client engagement, I was building some SSIS packages an issue came up where they didn?t want to use the CHECKSUM() function in TSQL to do column comparisons because the results can be inaccurate on some rare occasions. I personally have never come across this but others here at Pragmatic Works have. So we have two options freely available to work around this issue. The first is the third party component plugin that you can get free at codeplex called Multiple Hash. The client wasn?t comfortable with having to install this component on multiple servers throughout the environment so that option wasn?t available to me. Instead I had to use the HASHBYTES() function in TSQL.
HASHBYTES() is much more reliable than checksum when it comes to producing accurate results, but it comes at a slight cost.
The first thing to note is how to construct the HASHBYTES() function. In the first part you tell the function which algorithm you are going to use. I?m using SHA1, but be aware that they single tics ? ? are required followed by a comma. Then you must concatenate the columns you wish to use together as seen below.
There you can see its already a bit more arduous than using CHECKSUM(), but not that big of a deal to concatenate a bunch of columns. Lets look at the results.
Uh-Oh here is our first problem. HASHBYTES() doesn?t work with NULL values inside any columns. So we?ve got to handle the Nulls in our query using the ISNULL() function.
Now the results look like this:
Then next thing that you have to look at is how HASHBYTES() handles(or more accurately doesn?t handle) Data types. Here the ID column is an INT data type, but the same holds true for any non-string data type.
We get an error saying that the data type is wrong for the HASHBYTES() function
So now we have to CAST every column that is a non-string data type.
Now after this fix our results look better.
So as you can see already there will be a good deal more T-SQL coding involved with using HASHBYTES then with CHECKSUM(). But this isn?t all. The last little gotcha isn?t quite as obvious as the first two. Lets go back to our Null handling query.
I?ve gone and edited the data some for this example. Please also note that HASHBYTES() is case sensitive meaning that if you have the same spelling but different casing at an individual character level the hash value returned will be different.
The rows are different from one another but when concatenated together for the HASBYTES() function they produce the same exact value. So to handle this we have to update our code again. We are going to add a rarely used character to the concatenation so ensure that they results will return correctly.
Here I chose to use a pipe to basically delimit the columns thereby making them different from each other. The result is much better.
If we take a quick look at results of the two concatenations we can see why we get the different results
The delimited column is obviously different from one row to the next, and the Non-Delimited column is exactly the same for each row.
While HASHBYTES() is definitely more accurate and reliable than using CHECKSUM(), there are some hoops to jump through to insure you get the results you need. I hope this helps you guys out.