Update: This post discusses shrinking the data file of a Microsoft SQL database, caused in this case by misconfiguring the initial size of the database. For shrinking the log file one must at least use type 1, not 0, in the query. Also, a very pertinent comment from NULLable warns of the performance issues related to shrinking database files resulting from the fragmentation of the file.
I had this situation when the available space on the SQL database disk was less than the size of the database, in this case the temp database. Someone had wrongly configured the database to have an initial size of 64GB. Changing the size of the file in Microsoft SQL Management Studio doesn't work because it tries to create a different file, fill it with the data and then replace the file. No space for that. Also, it is damn slow, even if you have the space (I have no idea why). Shrink doesn't work either, because the database will not go smaller than the configured initial size. Time to do it command line style. Well, with sql queries, but you know what I mean.
DBCC SHRINKFILE (tempdev, 3000); --- New file size in MB
As you can see, you need to know not only the name of the database, but also the logical name of the database file that you want to shrink. It is not even a string, it is like a keyword in the DBCC SHRINKFILE command. Even if it does work, one would benefit from encapsulating it into a stored procedure. Here is the final code:
CREATEPROC ShrinkDatabase(@DbName NVARCHAR(100),@SizeMB INT) AS BEGIN
DECLARE @filename NVARCHAR(255)
DECLARE @sql NVARCHAR(Max) = 'SELECT @filename = dbf.name FROM ['+REPLACE(@DbName,'''','''''')+'].sys.database_files dbf WHERE dbf.[type]=0' EXEC sp_executesql @sql,N'@filename NVARCHAR(255) OUTPUT',@filename OUTPUT
SET @sql='USE ['+REPLACE(@DbName,'''','''''')+']; DBCC SHRINKFILE ('+REPLACE(@filename,'''','''''')+', '+CONVERT(NVARCHAR(100),@SizeMb)+');' EXEC sp_executesql @sql
END
Create it in the master database and use it like this:
EXEC master.dbo.ShrinkDatabase 'tempdb',3000
Take note that you cannot use this to "shrink up" the database. If the value you set is larger than the current size, the file will remain the same size as well as the setting for the initial size. Also take note of the fact that this stored procedure only shrinks the data file, not the log file (dbf.[type]=0).
Relational databases have the great advantage of being able to use indexes. These are constructs that trade space and the speed of inserts and updates for query speed. This introduces a problem: what do you do when you have a high input and high output application? How can you make your database fast for both changes and queries?
The traditional solution is a write-database and a read-database. There are background services that ensure that the read-database is updated as fast as necessary with the new data, most of the time also doing extra computation and caching for the information one needs extracted. OLAP systems are a common solution, where the aggregated data required by various reports is computed and stored, ready to be read by your applications. I am not saying this is a bad idea, in fact I am saying this is the best idea for this scenario. The problem that I have with it is that you can hardly automate the process. You need to know what you want to read, you need to write the software to create the data and aggregate it into what you need.
So I decided to try to build a system that obeys the following rules:
The speed of insert and update operations needs to be unhindered by indexes. Indeed, changes to the original data should be avoided.
The select operation need to benefit from the indexing system.
The system must be able to determine by itself the data structures needed to cover the first two rules.
Optionally, there should be a way to encapsulate this new behaviour into a separate data structure from the original database.
In order to insert and update as fast as possible, I needed tables that only had primary keys, identity integers rather than uniqueidentifiers, as they lead to fragmentation of clustered indexes. In order to not only index the columns that are used in where and join conditions, but also encapsulate the behaviour in some other data structure, I decided to create shadow tables with the columns that I needed for querying. These tables I would then index in order to improve selects. The connection between the original insert table and the select table would be made via primary keys and the synchronization between the two types of tables would be done in the background, whenever needed. Best of all, analysis on execution plans could automatically determine the columns needed for this system and create the tables and indexes required, then suggest improvements on the observed queries.
In order to test my theory I created the following tables:
OriginalTable - with a primary key identity ingeniously called Id and two other columns called Latitude and Longitude, containing random decimal(18,6) values
LocationIndexTable - with a refId integer primary key column that links to the OriginalTable Id - without being a foreign key - and two Latitude and Longitude columns that are indexed by the same non clustered index
LocationIndexTable2 - with a refId integer primary key column that links to the OriginalTable Id - without being a foreign key - and a locId integer column that links to another table called LocationTable and has its own non clustered index
LocationTable - with a primary key identity column and Latitude and Longitude columns. I know that this looks identical to LocationIndexTable, but this is the more complex case where there are multiple records with the same location. LocationTable holds the distinct Location and Latitude values
LocationIndexTable3 - with a refId integer column that links to the OriginalTable Id and is indexed by its own nonclustered index - without being a foreign key - and two Latitude and Longitude columns that are indexed by a clustered index
With a number of 77179072 original table rows, I attempted the queries for each case, careful to clean the cache and memory buffers before that. Here are the results:
SELECT count(1) FROM OriginalTable WHERE Latitude BETWEEN 45.5 AND 46 AND Longitude BETWEEN 8.5 AND 9 - no indexes whatsoever. Result: 30 seconds
SELECT count(1) FROM OriginalTable ot INNER JOIN LocationIndexTable lit ON lit.RefId=ot.Id WHERE lit.Latitude BETWEEN 45.5 AND 46 AND lit.Longitude BETWEEN 8.5 AND 9. Result: 17 seconds
SELECT count(1) FROM OriginalTable ot INNER JOIN LocationIndexTable2 lit2 ON lit2.RefId=ot.Id INNER JOIN LocationTable lt ON lit2.LocId=lt.Id WHERE lt.Latitude BETWEEN 45.5 AND 46 AND lt.Longitude BETWEEN 8.5 AND 9. Result: 41 seconds
SELECT count(1) FROM OriginalTable ot INNER JOIN LocationIndexTable3 lit ON lit.RefId=ot.Id WHERE lit.Latitude BETWEEN 45.5 AND 46 AND lit.Longitude BETWEEN 8.5 AND 9. Result: 22 seconds
Unexpectedly for me, the most comfortable solution also seems the faster. Indeed, one issue is that I didn't have duplicate location data in the database, so there was no advantage in adding a new table to link locations to the original table. That means that in cases where the indexed data has many duplicates, it might be advantageous to experiment with a "distinct" table, although indexing should take this case into account, as well. The clustered index is slower than the unclustered one, that was a surprise. Also, the indexing just made the query twice as fast - I had expected more. Of course, all this benchmarking was done after deleting the cache and buffers with the commands DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS. It is interesting to see how fast they queries go without this clearing. The first unindexed query takes 3 or 4 seconds, while all the others take less than 2.
There is one more thing that needs to be addressed: moving these tables into another database. Are the indexes just as fast? They should be, but we must test interdatabase communication. This would allow to move the entire system outside a target database, truly encapsulated, really adding value to a database that remains unaffected. My tests show the following results: 28, 18, 29 and 18 seconds, respectively. Yes, you saw that right, the speed of the joins when the indexed databases are in another database on the same server seems faster! Just to make sure I reran the original tests on the same database and I got approximately the same results: 29,19,43 and 24, respectively. The only thing I didn't try (and I don't intend to) is to create primary-foreign key associations, as this means modifying the original tables, something I wish to avoid.
So, after all this I believe my idea has been validated. A lot more work has to be done in order to automate this system, but at least a no-effort manual solution is possible. Let's recap:
The speed of row inserts and updates remains unchanged: the only index is the original primary key identity integer key that should always exist in a table anyway.
The speed of selects is improved by creating tables that have an integer primary key that links to the original table, and only the columns used in queries, over which indexes are created.
Queries can be used to determine the columns needed to index. Even better, computed columns can be used instead of the original columns, which adds a little more performance.
Encapsulation is achieved by not only creating other tables for reading, but also moving these tables into another database, after which, unexpectedly, the performance is even better.
The end result is a system similar to indexing, but which takes space in another database and which is updated on demand, not when the system deems it necessary. Any SQL statements that worked before will continue to work unaffected, but faster alternatives for selects will become available. Overall, I am very satisfied, although I had expected better performance improvements than observed.
Just to remember this for future work. I wanted to replace GetDate() default column values with SysUtcDatetime(). This is the script used:
-- declare a string that will hold the actual SQL executed DECLARE @SQL NVARCHAR(Max) = '' SELECT @SQL=@SQL+ N'ALTER TABLE ['+t.name+'] DROP CONSTRAINT ['+o.name+']; ALTER TABLE ['+t.name+'] ADD DEFAULT SYSUTCDATETIME() FOR ['+c.name+']; '-- drop the default value constraint, then add another with SYSUTCDATETIME() as default value FROM sys.all_columns c -- get the name of the columns INNERJOIN sys.tables t -- get the name of the tables containing the columns ON c.object_id=t.object_id INNERJOIN sys.default_constraints o -- we are only interested in default value constraints ON c.default_object_id=o.object_id WHERE o.definition='(getdate())'-- only interested in the columns with getdate() as default value
Recently I created a framework for translating JSON requests from a REST API to entities sent to the database. For simple objects, it was very easy, just create an SQL parameter for each property. However, for complex objects - having other objects as properties - this was not a solution. So I used a DataContractSerializer to transform the object to XML, send it as an XML SQL parameter and get the values from it in the stored procedures. Then I noticed date time inconsistencies between the two approaches. What was going on?
Let's start with the code. The DateTime object created from the JSON is a date and time value with a timezone, like 16:00 UTC+1. That is 15:00 in universal time. One you send it as a parameter for a stored procedure, the value received by the stored procedure is 16:00 (the server has the same timezone). In SQL Server, DATETIME and DATETIME2 types don't store timezone information. However, when sent through XML, the value looks like this: 2015-03-09T16:00:0.0000000+01:00. Using SELECT [Time] = T.Item.value('@Time','DATETIME2') FROM @Xml.nodes('//Location/SDO') as T(Item), the value returned is 15:00! You get 16:00+01 if you translate to DATETIMEOFFSET.
So let's recap: When you send a DateTime with timezone offset as an SQL parameter, the value reaching the SQL server is the local time. When you extract a textual value with timezone offset from an XML into a DATETIME, using the .value method, the value you get back is basically the Universal Time.
Solutions? Well, if you are going to use DateTime, you might as well consider that servers and devices are not always in the same timezone. Always translating values to universal time might be a good idea. Another solution is to extract from XML to a DATETIMEOFFSET type, which holds both the datetime and the timezone offset. Converting that value to DATETIME or DATETIME2 removes the timezone (Warning: it does NOT give the local time, unless you are in the same zone as the timezone in the datetimeoffset value).
I was trying to solve another problem, that operations creating, altering or dropping databases cannot be made inside a .Net transaction. Therefore I created a new TransactionScope inside the main one, using the TransactionScopeOption.Suppress option, but then I started getting this weird exception: The transaction associated with the current connection has completed but has not been disposed. The transaction must be disposed before the connection can be used to execute SQL statements. But I did Complete and Dispose all my transaction scopes, so what was going on? Long story short, this is really confusing message for a simple problem: your transaction probably timed out. Create the transaction scope with a large TimeSpan as the Timeout and it will get through. If you can, you can use a try/catch/finally block in which you Dispose the transaction (just remember that a using construct is basically a try/finally block). In my case, I was conditionally creating this new TransactionScope, so I wasn't using using. The transaction would fail, bleed into the other transaction where the confusing exception was being thrown.
I have another SQL quirk for you today, particularly in Microsoft's SQL Server. It concerns the AVG function when the SUM of the averaged values would result in an overflow for that type: AVG fails with the same error. Let's imagine for a moment that we wouldn't have an AVG function. In that case, our code would use SUM and COUNT to average values. For the integer column x, the naive SUM(x)/COUNT() would fail when the sum goes over the INT maximum value (I am not sure that SUM should fail, either, but that is another discussion). So our solution would be something like CONVERT(INT,SUM(CONVERT(BIGINT,x)))/COUNT(). Obviously, this is the solution for the AVG issue, just average the values converted to the biggest type available. So it seems that in their flagship SQL server, Microsoft implemented the naive version, hmm... To be fair, this behaviour is documented in the T-SQL function page: If the sum exceeds the maximum value for the data type of the return value an error will be returned.
As I see it, it is a bug, as this should have been handled internally, and one is logged in Microsoft Connect here: AVG function causes Arithmetic overflow error. Read the funny comment from some Microsoft employee who says there is a solution proposed, but they don't have time to implement it in SQL Server 2008, so probably they will do that in a future version. I'm trying on SQL Server 2014; it's 7 years later, Microsoft!!
Just for kicks, I tried the same in MySQL and it works there. More than that, it doesn't fail for SUM, either, as it automatically returns a BIGINT value, pushing the issue to an eventual insert or update in an INT column.
I was trying to write a simple query in MySQL that only inserted a row if there wasn't one already there, else it would update the existing one. MySQL has a system to take care of that for you if you have a primary key or a unique index on the table you want to insert or update. You can choose between INSERT ... ON DUPLICATE KEY or REPLACE. However, what if you don't want to add an index on your table just for that?
T-SQL (Microsoft SQL Server) has a syntax that works like this:
IFEXISTS(SELECT * FROM MyTable WHERE y=2) UPDATE MyTable SET x=x+1 WHERE y=2; ELSE INSERTINTO MyTable (x,y) VALUES(1,2); END
MySQL also has an IF...THEN syntax as well as an EXISTS statement. The problem is that they work in a completely different way. IF has an extra THEN keyword, uses ELSEIF instead of ELSE and needs to end with an END IF. EXISTS works only in WHERE clauses. So, let's translate the query above in MySQL and hope for the best:
INSERTINTO MyTable (x,y) SELECT 0,2 FROM DUAL WHERENOT EXISTS(SELECT * FROM MyTable WHERE y=2); UPDATE MyTable SET x=x+1 WHERE y=2;
Notice several things: I am not using any IF and I am updating rows all the time after the conditional insert. I am selecting values from DUAL because any select with a where clause needs a table in MySQL and DUAL is a dummy one built into the engine. (SELECT 1 WHERE 2=2; is not valid in MySQL). Also, I am inserting a value, then updating it, which is not the most efficient.
There is another way, closer to the original T-SQL query, but it doesn't use EXISTS. It looks like this:
IF (SELECT 1=1 FROM MyTable WHERE y=2) THEN UPDATE MyTable SET x=x+1 WHERE y=2; ELSE INSERTINTO MyTable (x,y) VALUES(1,2); ENDIF;
Thing to notice here: I am using 1=1 to return a TRUE value that will make the IF work. Also, PAY ATTENTION!, this doesn't work as a simple query. I spent the better half of an hour trying to understand where the syntax error was while trying to execute this directly. Any flow operations like IF or WHILE, etc, are only valid in "programs", the MySQL term for stored procedures and functions.
The algorithm works perfectly well and is better than Sift3, however it's slightly more complex. You might want to start with Sift3 in order to understand where it came from.
Update November 8 2022: I found a bug in the algorithm, relating to maxDistance. I've updated the code. If you didn't use maxDistance, you are unaffected. Basically the fix is to compare temporaryDistance>minDistance (before it was >= ) and to move the calculation of the temporary distance after c1 and c2 are updated to their minimum value when a token was not found (otherwise the temporary distance might become larger than the final distance)
Try the Javascript implementation here:
Algorithm:
MaxOffset:
String 1:
String 2:
Result:
Update 28 Mar 2015: I've changed the algorithm significantly. The transpositions are now computed differently and the cost of a transposition in the final result is 1, rather than 0.5. Also, while I think a value of 1 is better conceptually, I noticed that Sift4 approximates Levenshtein a little better when the cost of a transposition is either 2 or a function depending on the offset difference between c2 and c1, especially when maxOffset grows. This can be now changed via the new options function transpositionCostEvaluator. The problem I am having now is more false positives when the letters/tokens of the two strings are the same, but their positions are jumbled differently. With small maxOffset values, like 5 or 10, the result is much better than Sift3, however when maxOffset grows, lots of matches can be found and the cost of transpositions becomes very important.
Update 27 Mar 2015: Thanks to Emanuele Bastianelli who discovered a bug that appeared in an edge case, I've updated the algorithms. Now, at the end of the while loop there is an extra check to prevent the algorithm exiting prematurely, before computing remaining tokens.
Intro
A really long time ago I wrote the third version of Sift, the string distance algorithm. It so happens that I am going to give a small presentation, here in Ispra, about this algorithm, so I had the opportunity to review it. I found some inconsistencies and I actually did some research in the field that gave more more ideas. So before giving the presentation I thought of publishing what I think is the fourth version. What's new:
33% more accurate
three different variants: simple, common and general
new concepts added
support for own value and matching functions, different tokenizer functions, etc.
actually tested with a (slightly more) serious test
more robust, working better for large values of maxOffset
Before I get into the details, I am publishing the algorithm here for the moment, no Codeplex or PasteBin or GitHub or whatever. Also, it is written in Javascript now, the C# and T-SQL version pending. Of course, it would be great if, as before, the community of people using the algorithm would go into implementing it into various programming languages, however I am a bit apprehensive because more often than not people came with their own improvements or interpretations when translating the algorithm into another language. But support is always welcome!
New concepts in Sift4
I created a test that used random strings, but also a huge list of commonly used English phrases as well as mutations on these strings, adding or removing small bits and so on. I then implemented Sift3, Levenstein and the new algorithm and computed the error distance between the Levenstein distance and the two Sift variants. This permitted me to see how the error evolves when changing the algorithm and the parameters. One thing I noticed is that when increasing the maxOffset value to large values like 15 or 20, the accuracy of Sift3 was going down. Also, as pointed out by one commenter on the Sift3 post, there are cases when Sift3(a,b) is different from Sift3(b,a). There are edge cases, but this one in particular grated me.
After implementing Sift4, I can now tell you that the simple version is slightly better than Sift3 for small maxOffset values like 5, but it gets better as the value increases. The common version is a bit more complex, but the error decreases with 33% and maintains a low error for large maxOffset values. The extended or general version receives an options object that can change almost everything, but most important is the tokenizer function. Imagine that you want to compute the distance based not on letters, but on n-grams (groups of n characters). Or that you want to compare them by the words in the text, maybe even their synonyms. This can all be achieved just by changing the tokenizer function. The other parameters involve defining what it means for two tokens to match and what is the value of their match, etc.
One of the new concepts implemented is taken from the Jaro distance. Jaro seems a lot like Sift in the way that it considers two characters to match if they are in close proximity. Also, if "the streams cross", like 'ab' vs 'ba', one considers them transpositions and removes some of their value from the distance. Actually, if I look at the implementation, it might be that I have independently discovered the Jaro distance. I will research this further. I don't know if the transposition calculation is the most optimal. At the moment it uses an array of all matches found until a point, clearing it of values as the cursors move along the string. The difference between the simple and the common versions of Sift4 is that the simple version is not computing the transpositions at all and has no concept of maxDistance. In that respect it is a slightly fixed up Sift3.
Another new concept added is the one of local substring. Imagine that the Largest Common Subsequence that Sift is actually trying to find in order to determine the distance is made of substrings, separated by non matching characters. Each of these substrings can be used to improve the distance function. For example one could argue that 'abcdex' is closer to 'abcde' than 'abcxde', because even if the largest common subsequence is 5, the largest common substring is 5 for the first string and only 3 for the second. The extended version of the algorithm allows for changing the value of each substring individually.
Well, here they are, the three versions. The extended version has some examples at the end for possible parameters.
The code
Simplest Sift4:
// Sift4 - simplest version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
function sift4(s1, s2, maxOffset) {
if (!s1 || !s1.length) {
if (!s2) {
return 0;
}
return s2.length;
}
if (!s2 || !s2.length) {
return s1.length;
}
var l1 = s1.length;
var l2 = s2.length;
var c1 = 0; //cursor for string 1
var c2 = 0; //cursor for string 2
var lcss = 0; //largest common subsequence
var local_cs = 0; //local common substring
while ((c1 < l1) && (c2 < l2)) {
if (s1.charAt(c1) == s2.charAt(c2)) {
local_cs++;
} else {
lcss += local_cs;
local_cs = 0;
if (c1 != c2) {
c1 = c2 = Math.max(c1, c2); //using max to bypass the need for computer transpositions ('ab' vs 'ba')
}
for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
c1 += i;
local_cs++;
break;
}
if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
c2 += i;
local_cs++;
break;
}
}
}
c1++;
c2++;
}
lcss += local_cs;
return Math.round(Math.max(l1, l2) - lcss);
}
Common Sift4:
// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
// maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4(s1, s2, maxOffset, maxDistance) {
if (!s1 || !s1.length) {
if (!s2) {
return 0;
}
return s2.length;
}
if (!s2 || !s2.length) {
return s1.length;
}
var l1 = s1.length;
var l2 = s2.length;
var c1 = 0; //cursor for string 1
var c2 = 0; //cursor for string 2
var lcss = 0; //largest common subsequence
var local_cs = 0; //local common substring
var trans = 0; //number of transpositions ('ab' vs 'ba')
var offset_arr = []; //offset pair array, for computing the transpositions
while ((c1 < l1) && (c2 < l2)) {
if (s1.charAt(c1) == s2.charAt(c2)) {
local_cs++;
var isTrans = false;
//see if current match is a transposition
var i = 0;
while (i < offset_arr.length) {
var ofs = offset_arr[i];
if (c1 <= ofs.c1 || c2 <= ofs.c2) {
// when two matches cross, the one considered a transposition is the one with the largest difference in offsets
isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
if (isTrans) {
trans++;
} else {
if (!ofs.trans) {
ofs.trans = true;
trans++;
}
}
break;
} else {
if (c1 > ofs.c2 && c2 > ofs.c1) {
offset_arr.splice(i, 1);
} else {
i++;
}
}
}
offset_arr.push({
c1: c1,
c2: c2,
trans: isTrans
});
} else {
lcss += local_cs;
local_cs = 0;
if (c1 != c2) {
c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
}
if (maxDistance) {
var temporaryDistance = Math.max(c1, c2) - lcss + trans;
if (temporaryDistance > maxDistance)
return temporaryDistance;
}
//if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
//so that we can have only one code block handling matches
for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
c1 += i - 1;
c2--;
break;
}
if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
c1--;
c2 += i - 1;
break;
}
}
}
c1++;
c2++;
// this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
if ((c1 >= l1) || (c2 >= l2)) {
lcss += local_cs;
local_cs = 0;
c1 = c2 = Math.min(c1, c2);
}
}
lcss += local_cs;
return Math.max(l1, l2) - lcss + trans; //add the cost of transpositions to the final result
}
Extended/General Sift4:
// Sift4 - extended version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of positions to search for matching tokens
// options: the options for the function, allowing for customization of the scope and algorithm:
// maxDistance: the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
// tokenizer: a function to transform strings into vectors of tokens
// tokenMatcher: a function to determine if two tokens are matching (equal)
// matchingEvaluator: a function to determine the way a token match should be added to the local_cs. For example a fuzzy match could be implemented.
// localLengthEvaluator: a function to determine the way the local_cs value is added to the lcss. For example longer continuous substrings could be awarded.
// transpositionCostEvaluator: a function to determine the value of an individual transposition. For example longer transpositions should have a higher cost.
// transpositionsEvaluator: a function to determine the way the total cost of transpositions affects the final result
// the options can and should be implemented at a class level, but this is the demo algorithm
function sift4(s1, s2, maxOffset, options) {
options = extend(options, {
maxDistance: null,
tokenizer: function (s) {
return s ? s.split('') : [];
},
tokenMatcher: function (t1, t2) {
return t1 == t2;
},
matchingEvaluator: function (t1, t2) {
return 1;
},
localLengthEvaluator: function (local_cs) {
return local_cs;
},
transpositionCostEvaluator: function (c1, c2) {
return 1;
},
transpositionsEvaluator: function (lcss, trans) {
return lcss - trans;
}
});
var t1 = options.tokenizer(s1);
var t2 = options.tokenizer(s2);
var l1 = t1.length;
var l2 = t2.length;
if (l1 == 0)
return l2;
if (l2 == 0)
return l1;
var c1 = 0; //cursor for string 1
var c2 = 0; //cursor for string 2
var lcss = 0; //largest common subsequence
var local_cs = 0; //local common substring
var trans = 0; //number of transpositions ('ab' vs 'ba')
var offset_arr = []; //offset pair array, for computing the transpositions
while ((c1 < l1) && (c2 < l2)) {
if (options.tokenMatcher(t1[c1], t2[c2])) {
local_cs += options.matchingEvaluator(t1[c1], t2[c2]);
var isTrans = false;
//see if current match is a transposition
var i = 0;
while (i < offset_arr.length) {
var ofs = offset_arr[i];
if (c1 <= ofs.c1 || c2 <= ofs.c2) {
// when two matches cross, the one considered a transposition is the one with the largest difference in offsets
isTrans = Math.abs(c2 - c1) >= Math.abs(ofs.c2 - ofs.c1);
if (isTrans) {
trans += options.transpositionCostEvaluator(c1, c2);
} else {
if (!ofs.trans) {
ofs.trans = true;
trans += options.transpositionCostEvaluator(ofs.c1, ofs.c2);
}
}
break;
} else {
if (c1 > ofs.c2 && c2 > ofs.c1) {
offset_arr.splice(i, 1);
} else {
i++;
}
}
}
offset_arr.push({
c1: c1,
c2: c2,
trans: isTrans
});
} else {
lcss += options.localLengthEvaluator(local_cs);
local_cs = 0;
if (c1 != c2) {
c1 = c2 = Math.min(c1, c2); //using min allows the computation of transpositions
}
if (options.maxDistance) {
var temporaryDistance = options.localLengthEvaluator(Math.max(c1, c2)) - options.transpositionsEvaluator(lcss, trans);
if (temporaryDistance > options.maxDistance)
return Math.round(temporaryDistance);
}
//if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
//so that we can have only one code block handling matches
for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++) {
if ((c1 + i < l1) && options.tokenMatcher(t1[c1 + i], t2[c2])) {
c1 += i - 1;
c2--;
break;
}
if ((c2 + i < l2) && options.tokenMatcher(t1[c1], t2[c2 + i])) {
c1--;
c2 += i - 1;
break;
}
}
}
c1++;
c2++;
// this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
if ((c1 >= l1) || (c2 >= l2)) {
lcss += options.localLengthEvaluator(local_cs);
local_cs = 0;
c1 = c2 = Math.min(c1, c2);
}
}
lcss += options.localLengthEvaluator(local_cs);
return Math.round(options.localLengthEvaluator(Math.max(l1, l2)) - options.transpositionsEvaluator(lcss, trans)); //add the cost of found transpositions
}
function extend(obj, def) {
var result = {};
for (var prop in def) {
if (!obj || !obj.hasOwnProperty(prop)) {
result[prop] = def[prop];
} else {
result[prop] = obj[prop];
}
}
return result;
}
// possible values for the options
// tokenizers:
function nGramTokenizer(s, n) { //tokenizer:function(s) { return nGramTokenizer(s,2); }
var result = [];
if (!s)
return result;
for (var i = 0; i <= s.length - n; i++) {
result.push(s.substr(i, n));
}
return result;
}
function wordSplitTokenizer(s) { //tokenizer:wordSplitTokenizer
if (!s)
return [];
return s.split(/\s+/);
}
function characterFrequencyTokenizer(s) { //tokenizer:characterFrequencyTokenizer (letters only)
var result = [];
for (var i = 0; i <= 25; i++) {
var val = 0;
if (s) {
for (j = 0; j < s.length; j++) {
var code = s.charCodeAt(j);
if (code == i + 65 || code == i + 97)
val++;
}
}
result.push(val);
}
return result;
}
//tokenMatchers:
function sift4TokenMatcher(t1, t2) { //tokenMatcher:sift4TokenMatcher
var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
return similarity > 0.7;
}
//matchingEvaluators:
function sift4MatchingEvaluator(t1, t2) { //matchingEvaluator:sift4MatchingEvaluator
var similarity = 1 - sift4(t1, t2, 5) / Math.max(t1.length, t2.length);
return similarity;
}
//localLengthEvaluators:
function rewardLengthEvaluator(l) {
if (l < 1)
return l; //0 -> 0
return l - 1 / (l + 1); //1 -> 0.5, 2-> 0.66, 9 -> 0.9
}
function rewardLengthEvaluator2(l) {
return Math.pow(l, 1.5); // 0 -> 0, 1 -> 1, 2 -> 2.83, 10 -> 31.62
}
//transpositionCostEvaluators:
function longerTranspositionsAreMoreCostly(c1, c2) {
return Math.abs(c2 - c1) / 9 + 1;
}
As always, I will be most happy to know if you used my algorithm and how it performed, as well as receive any suggestion that you might have.
Options explained
Here is some explanation for the options of the general algorithm.
It no longer searches for characters, but for tokens. That is why the default tokenizer function splits the values into characters so that the algorithm would work on an array of one character long tokens. Other options are possible, like splitting the strings by empty spaces so that the comparisons are done on words or transforming a string into an array of strings N characters long, the so called N-grams. The tokenizer can be anything, like the characterFrequencyTokenizer, which turns each word in an array of 25 values representing the number of letters in the word for each letter a-z.
The tokenMatcher function returns true if two tokens are matching. They can be fuzzy matched, for example the sift4tokenMatcher example function uses Sift inside Sift to determine the character distance between two tokens and returns true if they match more than 70%.
The matchingEvaluator is a function that returns the value that will be added to the "common substring" length value when two tokens match. The default is 1, but one can use some other metric, like the similarity, for example. Of course, the common substring length has lost its meaning when these functions change, but the variable local_cs is still used.
The lengthEvaluator is taking the length value of the local common substring and returns a value that will be added to the longest common subsequence value. Usually it returns the same value as the one provided, but some functions could reward longer substrings.
FAQ
Q: Can you make Sift4 to work case insensitive? A: Just turn the strings to lower or upper case before you compare them. Since this algorithm is more general, the concept of 'case' might not apply. Or implement a case insensitive tokenMatcher.
Q: Can you make Sift4 to compare strings based on their meaning, like using synonyms? A: Use a tokenizer function that splits the strings into words, then replaces them with the most used of their synonyms. A more complex solution would require to analyze the strings beforehand and turn them into some ordered synonym or equivalent expressions equivalents, then use Sift4 with a word tokenizer (one is provided in the Extended algorithm source)
Q: I need an implementation for this programming language, can you help? A: I can, but I might not have the time. Ask anyway, maybe I can be persuaded :)
Q: I have been using Sift3 until now, how do I upgrade to Sift4? A: The best way I can think of is to implement Sift4 Simplest, as it needs only the Sift3 code and some minor changes. Since you never needed tokens before, I doubt you need them now. But if you do, I can help, see the above question.
Q: How can I reward you for this fantastic piece of software engineering? A: While I did this for free and I don't expect to make any money out of it and while this algorithm is completely free to use and change as you see fit, I don't mind having a beer every now and then ;)
Q: Your algorithm really sucks because... reasons. A: It may. I would be glad to discuss the reasons, though, and try to fix any problem you encounter.
Q: I compared Sift4 with another algorithm that is much more exact and there are differences. A: Of course, they are different algorithms. This is a fuzzy distance calculator, it doesn't give you the exact value. There are still edge cases. But the idea of Sift is to be fast and relatively accurate, rather than very accurate. You need more accuracy, try to combine Sift with Levenshtein for example, computing Levenshtein only where Sift says the strings are above a certain similarity.
Q: I want to make maxOffset dependent on the length of the strings compared. Can you do that? A: That is a perfect example why maxOffset should be a parameter of the function rather than a member of the class. Since this implementation is so far Javascript only, just compute the maxOffset that is convenient to you before you compare.
Q: I want to vary the weight of matches based on the position of the match, for example matches at the beginning of the string could be more valuable than those at the end. A: The position of the match is indeed not sent to the functions that can be specified in the options object of the Sift4 Extended, but that can be trivially changed in the code. I don't think this particular request is very common, though, and I prefer to keep it out of the published implementation to make the code easier to understand.
Q: I found a bug! A: Let me know it and I will try and fix it.
Q: If you need to compare large lists of strings, it is better to precompute some things, like specific hashes or suffix trees, etc. This will speed up the comparison tremendously! A: Sift is what is called an online algorithm. It does not precompute anything, it just gets the two strings and the parameters for its functioning and returns the distance. You are correct in what you are saying, but that kind of solution is not in the scope of Sift, at least not version 4.
Q: What are the edge cases for Sift? A: Probably there are several, but I didn't really spot them. One of them is that one might find both letters at a position matching letters at other positions, but only one will count. Example 'abxx' and 'bayy'. The algorithm will look at position 0, find no match, then try to find the closest match for each letter. Starting with position 0 in the first string it will find 'a' matched in position 1 in the second. It will increase both counters and lcss will be increase as well. Next check will be 'b', the character at position 1 in the first string matched with position 2 in the second string. No match, therefore both counters will be reset to 1, and starting search again. The 'b' match is lost and distance is 3 instead of 2. Also I think there might be some situations where the counters are not equal and the biggest of them reaches the end of its string, thus terminating the algorithm, but there could have been more matches. Incidentally I tried to fix both these issues and the error from Levenshtein was not really affected, but I am not 100% sure of the implementation.
Q: The algorithm continues to be asymmetric, Sift4(s1,s2) can be different from Sift4(s2,s1). A: Yes. This is one of the artifacts of the linear nature of the algorithm. There is a function that is symmetric and that is Math.min(Sift4(a,b),Sift4(b,a)), however it is twice as slow, obviously.
Implementations in other languages
You can find a Go implementation here, written by Jason W. Hutchinson. There is also a Swift implementation here. A Perl 6 (now called Raku) implementation can be found here.
You know how many a time you need to get the results of a query as a strings list, let's say a CSV and there is nothing in Microsoft SQL Server to help you? What you would like is something like an aggregate function similar to SUM that returns the concatenated value of all strings in the SELECT query.
From SQL Server 2017 on, there is such a function called STRING_AGG and you use it just like one might expect.
If you have an older version of SQL Server... upgrade :) But if you can't, here is a solution:
And before SQL Server 2008 that was the case, you needed to use variables and SELECT @S=@S+[Value] FROM .... But in SQL 2008 they added more XML support and thus the data() XML PATH method. Take notice that this method adds a space between atomic values. So, without further ado, here is the code to concatenate several values into a comma separated value string:
DECLARE @T TABLE(S NVARCHAR(Max)) INSERTINTO @T VALUES('Filet'),('T-bone'),('Sausage'),('Würstel') -- enough with the fruity examples!
SELECTCONVERT(NVARCHAR(Max),(SELECT S+','AS'data()'FROM @T t FOR XML PATH('')))
Result: Filet, T-bone, Sausage, Würstel, - of course, remove the last comma.
Update: What if I have a table with multiple columns and I just want to aggregate one?
The solution is to use the select with WHERE clauses on the GROUP BY columns. Here is an example:
SELECT col1, col2, CONVERT(NVARCHAR(Max),( SELECTvalue+','AS'data()'FROM @T t2 WHERE t1.col1 = t2.col1 AND t1.col2 = t2.col2 FOR XML PATH(''))) FROM @t t1 GROUPBY col1, col2
I have been working on a REST API lately and, while using Entity Framework or some other similar framework to abstract the database is certainly possible, I wanted to control every aspect of the implementation. I know, reinventing wheels, but this is how one learns. One of the most annoying bits was trying to translate some complex object from JSON to the (two dimensional) database relational tables. This post will explore my attempts and the solutions I have found.
My first attempt was straightforward: just send all types as DataTables, with some extra property to define identity and parent entity. This relies on the Microsoft Server SQL mechanism that allows sending of table variables to stored procedures. But this approach has several downsides. One of them is that in order to send a datatable to SQL you need... DataTables. As I have pointed out in several blog posts, the DataTable object is slow, and sometimes downright buggy. Even if I didn't care about performance that much, in order for SQL to receive the content of the DataTable one must create corresponding User Defined Types on the database side. Working with UDTs is very difficult for several reasons: you cannot alter a UDT (unless employing some SQL voodoo that changes system tables), you can only drop it and recreate it. This does not work if you use the UDT anywhere, so a lot of renaming needs to be done. Even if you automate the process, it's still very annoying. Then the UDT definition has to be an exact duplicate of the DataTable definition. Move some columns around and it fails. Debugging is also made difficult by the fact that the SQL profiler does not see the content of table variables when sending them to the server.
Long story short, I was looking for alternatives and I found XML. Now, you might think that this leads to a simple, and maybe even obvious, solution. But it's not that easy. Imagine that you send a list of objects in an XML. Each object is represented by an XML element and each property by a child element. In order to get the value of a property you need to do iterate through all the nodes, for each node find the properties, for each property find the one element that defines it, then get the attribute value or content of the property, all while making sure you select everything in a table. It's not that easy.
The solution I found, which simplifies the SQL code (and hopefully brings some well needed performance to the table) is to serialize the objects in a way that makes the selection simple enough. Here is an example: I have a Configuration object with an Id and a Name that also has a property called Servers, containing Server objects having an Id and a Url. Here is an example of XML serialization from the DataContractSerializer:
;WITH XMLNAMESPACES(DEFAULT'http://schemas.datacontract.org/2004/07/SerializationTest') SELECT T.Item.value('(Id/text())[1]','INT') as Id, T.Item.value('(Name/text())[1]','NVARCHAR(100)') as Name FROM @Xml.nodes('//Configuration') as T(Item)
;WITH XMLNAMESPACES(DEFAULT'http://schemas.datacontract.org/2004/07/SerializationTest') SELECT T.Item.value('(Id/text())[1]','INT') as Id, T.Item.value('(Url/text())[1]','NVARCHAR(100)') as Url FROM @Xml.nodes('//Configuration/Servers/Server') as T(Item)
This works, but look at that code. In my case, the situation was worse, the object I was using was a wrapper which implemented IDictionary<string,object> and, even if it did implement ISerializable, both XmlSerializer and DataContractSerializer use the dictionary as their data and in the end I get ugly key elements and value elements that are even harder to get to and, I suppose, more inefficient to parse. Therefore I found the solution in IXmlSerializable, (yet) another serialization interface used exclusively by XML serializer classes. If every simple value would be saved as an attribute and every complex object in an element, then this could be the SQL code:
;WITH XMLNAMESPACES(DEFAULT'http://schemas.datacontract.org/2004/07/SerializationTest') SELECT T.Item.value('@Id','INT') as Id, T.Item.value('@Name','NVARCHAR(100)') as Name FROM @Xml.nodes('//Configuration/Config') as T(Item)
;WITH XMLNAMESPACES(DEFAULT'http://schemas.datacontract.org/2004/07/SerializationTest') SELECT T.Item.value('@Id','INT') as Id, T.Item.value('@Url','NVARCHAR(100)') as Url FROM @Xml.nodes('//Configuration/Config/Servers/List/Server') as T(Item)
Much easier to read and hopefully to parse.
I am not going to write here about the actual implementation of IXmlSerializable. There are plenty of tutorials on the Internet about that. It's not pretty, :) but not too difficult, either.
What was the purpose of this exercise? Now I can send a complex object to SQL in a single query, making inserts and updates simple and not requiring at a call for each instance of each type of complex object. Now, is it fast? I have no idea. Certainly if performance is needed, perhaps the UDT/DataTable approach is faster. However you will have to define a type for each type that you send as a DataTable to a stored procedure. An alternative can be a binary serializer and a CLR SQL function that translates it into tables. However, in my project I need to easily implement very custom API methods and to control every aspect, including tracing and profiling the various SQL calls. I believe the customized IXmlSerializable/XML in SQL approach is a reasonable one.
I seem to remember that I blogged about this before, but I can't find it anymore. Probably it was just a missed intention. This is simply a warning on how T-SQL converts FLOAT values to string. Here are some Transact SQL queries and their results:
SELECT @aFloat,@aDecimal,@aNumeric,@aString -- result: 1234.123456789 1234.123456789 1234.123456789 1234.123456789 SELECTCAST(@aFloat as NVARCHAR(20)),CAST(@aDecimal as NVARCHAR(20)),CAST(@aNumeric as NVARCHAR(20)),CAST(@aString as NVARCHAR(20)) -- result: 1234.12 1234.123456789 1234.123456789 1234.123456789
Wait! What happened there? The FLOAT was the only numeric format that lost precision to only 2 decimals (it is actually a loss of scale, 12345.123456789 would be converted to 12345.1). The solution is either to either convert to DECIMAL or NUMERIC values before converting to NVARCHAR or to use the STR function, which receives the scale and precision parameters. Like this:
SELECTCAST(@aFloat as NVARCHAR(20)), CAST(CAST(@aFloat asDECIMAL(18,9)) as NVARCHAR(20)), STR(@aFloat,18,9) -- result: 1234.12 1234.123456789 1234.123456789
The first conversion to DECIMAL and the STR function later on are equivalent.
I have looked into SQL options to somehow set the default precision that is used when converting a float to string, but I could not find anything usefule. Neither did settings like
SET ARITHIGNORE OFF SET NUMERIC_ROUNDABORT ON SET ARITHABORT ON
have any effect on the queries above. No error and the same result.
You don't want to know what happens with 123456789.123456789!
SELECT @aFloat, CAST(@aFloat as NVARCHAR(20)), CAST(CAST(@aFloat asDECIMAL(30,10)) as NVARCHAR(30)), STR(@aFloat,30,10) -- result: 123456789.123457 1.23457e+008 123456789.1234567900 123456789.1234567900
Not only the digits are cut even when selecting the actual value!!, but the scientific notation rears its ugly head. And look at the beautiful STR function returning ugly extra zeroes! Same issue appears when trying to use XML functions. The resulting XML has really ugly strings of the float values.
Bottom line: as much as I hate it, you probably should not use FLOAT when trying to display values. Ever.
Update 2015 August 28: I've replaced the function master.sys.fn_varbintohexstr with CONVERT, with the extra parameter 2, which translates a binary field into a hexadecimal string with no leading 0x. In addition to being ugly to use, fn_varbintohexstr is very slow.
Sometimes you need to create a unique identifier for a bunch of values so that you use it as an ID in the database. The immediately obvious choice is the CHECKSUM and BINARYCHECKSUM functions. But beware, the purpose of these functions is to detect changes in a string, not to uniquely identify it. It might seem strange, but the two concepts are very different. The change modification functionality is only meant to generate very different values on small changes. The uniqueness is trying to create a value as distinctive as possible for any string. That is why when you use a checksum you will get a lot of similar values for (very) different strings.
Enter HASHBYTES, another function that has the purpose of creating a cryptographic hash for a string. It is mainly used for password hashing, but it will fit nicely for our purpose. There are some caveats, though. First, CHECKSUM gets a variable number of parameters, HASHBYTES only accepts one, so we must take care of the cumbersome concatenation of multiple values. Unfortunately SQL functions do not have the option of variable parameters, which is truly a shame, so we can't hack it. Also, the value that HASHBYTES returns is a varbinary. We could cast it to NVARCHAR, but it turns into a weird Chinese characters string. In order to turn it into a proper string, we need to use the same function used by SQL Server to display varbinary when selecting it: master.sys.fn_varbintohexstr the CONVERT function with a parameter of 2 (hex string without the leading 0x).
So let's compare the two usages. Suppose we have this nice table that contains company data: company name, contact first name, contact last name, phone, email, yearly value. We need to create a unique ID based on these values. First CHECKSUM:
SELECT CHECKSUM(companyName, firstName, lastName, phone, email, yearlyValue) FROM OurTable
So easy! Just add the columns, no matter how many or what type they have, and get a value as a result. You can even use * to select all columns in a row. You also have the advantage of getting the same checksum for differently capitalized strings. If you don't want this behaviour, use BINARYCHECSUM, which works even better.
Second HASHBYTES:
SELECTCONVERT(VARCHAR(Max),HASHBYTES('SHA1',companyName+'|'+firstName+'|'+lastName+'|'+phone+'|'+email+'|'+CAST(yearlyValue as NVARCHAR(100))),2) as id,* FROM OurTable
Ugly! You need to create a string from different types, using ugly casts. Also, this works more like BINARYCHECKSUM. If you want to get the same functionality as CHECKSUM you need to use LOWER(LTRIM(RTRIM(value))). Horrid! However, it works.
WARNING: using CAST to NVARCHAR from a FLOAT loses precision. You should use STR instead!
A middle solution is to use XCHECKSUM. What is that, you ask? A placeholder that can be replaced with some regular expression search and replace, of course :)
Update: I've created a query that creates the script to update the value of a column called 'ValuesHash', for tables that have it, with the hash of all columns that are not in a list of names, they are not primary keys and they are not foreign keys, plus they are not computed, rowguidcol or filestream. Imagine the scenario where you have something like this:
Table A:
Id: primary identity key
Data1: some data
Data2: some data
CreateTime: the creation time
ValuesHash: a VARBINARY(50) column - only 20 are required normally, but let's make sure :)
Table B:
Id: primary identity key
AId: foreign key to A
Data1: some data
Data2: some data
ModifyTime: the modification time
ValuesHash: a VARBINARY(50) column - only 20 are required normally, but let's make sure :)
Table C:
Id: primary identity key
AId: foreign key to A
Data1: some data
Data2: some data
The query below will update ValuesHash for A and B (because C doesn't have the ValuesHash column) with a hash constructed from the Data columns. The Id columns will be ignored for being primary keys (and for being in the list of columns to ignore), the AId columns will be ignored for being foreign keys, ValuesHash and CreateTime and ModifyTime will be ignored for being in a list of custom columns)
WARNING: each column data is always truncated to 4000 characters, then the corresponding string is also truncated to 4000 bytes before running HASHBYTES (which only accepts a maximum of 8000 bytes). This hash will help in determining unique records, but it is not 100%.
SELECT * FROM ( SELECT t.name, 'UPDATE [' + t.name+ '] SET ValuesHash = HASHBYTES(''SHA1'',SUBSTRING(' + Stuff( (SELECT'+ ''|''+ ISNULL('+CASE WHEN tp.name IN ('float', 'real') THEN'STR('+c.name+',30,30)' WHEN tp.name IN ('binary', 'varbinary') THEN'CONVERT(NVARCHAR(4000),'+c.name+',2)' ELSE'CONVERT(NVARCHAR(4000),'+c.name+')'END+','''')' FROM sys.all_columns c INNERJOIN sys.types tp ON c.system_type_id=tp.system_type_id AND c.user_type_id=tp.user_type_id LEFTJOIN sys.index_columns ic ON ic.object_id=t.object_id AND ic.column_id=c.column_id LEFTJOIN sys.indexes i ON ic.object_id=i.object_id AND ic.index_id=i.index_id LEFTJOIN sys.foreign_key_columns fc ON fc.parent_object_id=t.object_id AND c.column_id=fc.parent_column_id WHERE t.object_id=c.object_id AND ISNULL(c.is_identity, 0)=0 AND ISNULL(c.is_computed, 0)=0 AND ISNULL(c.is_filestream, 0)=0 AND ISNULL(c.is_rowguidcol, 0)=0 AND ISNULL(i.is_primary_key, 0)=0 AND fc.parent_column_id ISNULL AND c.name NOTIN ('Id', 'CreateTime' , 'AcquireTime' , 'IntermediateCreateTime', 'IntermediateModifyTime', 'IntermediateDeleteTime', 'ValuesHash') ORDERBY Sign(c.max_length) DESC, c.max_length, Lower(c.name) FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)') , 1, 7, '') + ',0,4000)) WHERE ValuesHash IS NULL'AS computed FROM sys.tables t INNERJOIN sys.all_columns c ON t.object_id = c.object_id WHERE c.name = 'ValuesHash') x WHERE computed ISNOTNULL ORDERBY name
Change it to suit your needs. It is by no means perfect, but it's a start for whatever you need.
Update:
A new FORMAT function was introduced in SQL Server 2012, working somewhat similar to the .NET ToString method. Using that function is slightly more precise:
SELECT * FROM ( SELECT t.name, 'UPDATE [' + t.name+ '] SET ValuesHash = HASHBYTES(''SHA1'',SUBSTRING(' + Stuff( (SELECT'+ ''|''+ ISNULL('+CASE WHEN tp.name IN ('float', 'real') THEN'FORMAT('+c.name+',''R'')' WHEN tp.name IN ('decimal') THEN'FORMAT('+c.name+',''G'')' WHEN tp.name IN ('datetime','datetime2') THEN'FORMAT('+c.name+',''O'')' WHEN tp.name IN ('binary', 'varbinary') THEN'CONVERT(NVARCHAR(4000),'+c.name+',2)' ELSE'CONVERT(NVARCHAR(4000),'+c.name+')'END+','''')' FROM sys.all_columns c INNERJOIN sys.types tp ON c.system_type_id=tp.system_type_id AND c.user_type_id=tp.user_type_id LEFTJOIN sys.index_columns ic ON ic.object_id=t.object_id AND ic.column_id=c.column_id LEFTJOIN sys.indexes i ON ic.object_id=i.object_id AND ic.index_id=i.index_id LEFTJOIN sys.foreign_key_columns fc ON fc.parent_object_id=t.object_id AND c.column_id=fc.parent_column_id WHERE t.object_id=c.object_id AND ISNULL(c.is_identity, 0)=0 AND ISNULL(c.is_computed, 0)=0 AND ISNULL(c.is_filestream, 0)=0 AND ISNULL(c.is_rowguidcol, 0)=0 AND ISNULL(i.is_primary_key, 0)=0 AND fc.parent_column_id ISNULL AND c.name NOTIN ('Id', 'CreateTime' , 'AcquireTime' , 'IntermediateCreateTime', 'IntermediateModifyTime', 'IntermediateDeleteTime', 'ValuesHash') ORDERBY Sign(c.max_length) DESC, c.max_length, Lower(c.name) FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)') , 1, 7, '') + ',0,4000)) WHERE ValuesHash IS NULL'AS computed FROM sys.tables t INNERJOIN sys.all_columns c ON t.object_id = c.object_id WHERE c.name = 'ValuesHash' ) x WHERE computed ISNOTNULL ORDERBY name
This is one of those WTF moments. After more than a decade of working in software development I learn something this basic about T-SQL (or rather, any flavour based on SQL-92). What would you think happens when running this script?
IF''=' 'SELECT'WTF?!'-- empty string compared to a bunch of spaces IF' '=' 'SELECT'WTF?!'-- bunch or spaces compared to another bunch of spaces of different length IF'x'='x 'SELECT'WTF?!'-- 'x' compared to 'x' followed by a bunch of spaces IF'x'=' x'SELECT'WTF?!' -- 'x' compared to'x' preceded by a bunch of spaces
There will be three WTF rows returned, for the first three queries. You don't believe me? Try it yourself. The motive is explained here: INF: How SQL Server Compares Strings with Trailing Spaces. Short story shorter: in order for SQL to compare two strings of different lengths, it first right-pads the shorter one with spaces.
So what can you do to fix it? Easy enough, use LEN ,right? Nope. Read the definition carefully: Returns the number of characters of the specified string expression, excluding trailing blanks. A possible but weird solution is to use DATALENGTH. A string is empty only is it has a datalength of 0. In the case of NVARCHAR you could even divide the resulting number to 2 in order to get the true length of the string. WTF, right?
Well, it's pretty obvious, but I wanted to post it here, as well. You see, we have this query that was supposed to filter some values based on a parameter. The query was done like this: SELECT * FROM table WHERE value=(CASE WHEN @filter IS NULL THEN value ELSE @filter). Can you spot the problem? Indeed, if value is NULL, then value is NOT equal to value. Not only is this incorrect, but also bad from the standpoint of the SQL execution plan. It is much faster to do SELECT * FROM table WHERE (@filter is NULL OR value=@filter). If, for whatever reason, you need the first syntax, you need to do it like this: SELECT * FROM table WHERE ISNULL(value,@impossibleValueThatIsNotNull)=COALESCE(@filter, value, @impossibleValueThatIsNotNull). Yeah, a bit of a show off, but when the "no filter" value is null, it's better to use ISNULL and COALESCE wherever possible.
I made a function in T-SQL that parsed some string and returned a decimal. It all worked fine until one of my colleagues wanted to use it on the test server. And here there was, a beautiful error message: 'decimal' is not a recognized built-in function name. I isolated the line, executed it, same error. It was like the server did not understand decimals anymore. The line in question was a simple SELECT TRY_CONVERT(DECIMAL(18,6),'10.33'). If I used CONVERT, though, it all worked fine. Another hint was that the function worked perfectly fine on the same server, in another database. The problem was that for that particular database, the defined SQL server version was 2008, not 2012. We changed it and it all worked fine after that. The setting in question is found in the Properties of the database, Options, Compatibility level.