It is a bit embarrassing not knowing this at my level of software development, but I was stunned to see other people, even more experienced than I, had the same lack of knowledge. Apparently Microsoft SQL Server converts empty or whitespace strings to default values when using CONVERT or CAST. So CONVERT(INT,''), equivalent to CAST('' as INT), equals 0. DATETIME conversion leads to a value of 1900-01-01. And so on. That means that a good practice for data conversion when you don't know what data you may be getting is to always turn whitespace to null before using CONVERT or CAST. Also, in related news, newline is NOT whitespace in T-SQL so LTRIM(CHAR(10)) and LTRIM(CHAR(13)) is not empty string!

Bottom line: instead of CONVERT(<type>,<unknown string value>) use the cumbersome CONVERT(<type>,CASE WHEN LTRIM(RTRIM(<unknown string value>))!='' THEN <unknown string value> END). Same with CAST.

Here is a table of conversions for some values converted to FLOAT:

ValueNormal CONVERTCumbersome CONVERTTRY_CONVERT
NULLNULLNULLNULL
'' (empty string)0NULL0
' ' (whitespace)0NULL0
'
' (whitespace and newlines)
Conversion errorConversion errorNULL
'123'123123123

You might think this is not such a big deal, but in Microsoft SQL 2012 they introduced TRY_CONVERT and the similar TRY_CAST, which return null if there is a conversion error. This means that for an incorrect string value the function would return null for most but empty string, where it would return the default value of the type chosen, thus resulting in an inconsistent behavior.

This is something I have been hitting my head on from the beginning of my programming career: just find the best match in a table for each row in another table through a single query.

There are solutions, but they are all very inefficient. To demonstrate the issue I will start with a simple structure: tables A and B, having the same columns id, x and y. I want to get, for each point in table A defined by the (x,y) coordinates, the closest point in table B. I only need one and it doesn't need to be exclusive (other points in A might be closest to the same point). It doesn't even have to be one row in B for each row in A, in case there are two points at the exact same distance to a point in A. The creation of the structure is done here:
CREATE TABLE A(id INT PRIMARY KEY IDENTITY(1,1), x FLOAT, y FLOAT)
INSERT INTO A (x,y) VALUES(10,20),(20,30),(20,10),(30,20),(30,20),(10,30)

CREATE TABLE B(id INT PRIMARY KEY IDENTITY(1,1), x FLOAT, y FLOAT)
INSERT INTO B (x,y) VALUES(11,20),(20,31),(21,10),(31,21),(30,20),(11,30)

To find the distance from A to the closest point in B is trivial:
SELECT a.id, 
a.x,
a.y,
Min(( a.x - b.x ) * ( a.x - b.x ) + ( a.y - b.y ) * ( a.y - b.y )) AS
dist
FROM a
CROSS JOIN b
GROUP BY a.id,
a.x,
a.y
To get the id of the closest B point, not so easy.

The first naive solution would be to just find the row in B that corresponds to each row in A using nested selects, like this:
SELECT * 
FROM a
JOIN b
ON b.id = (SELECT TOP 1 b.id
FROM b
ORDER BY ( a.x - b.x ) * ( a.x - b.x ) + ( a.y - b.y ) * ( a.y - b.y ) ASC)

Looking at the execution plan we see what is going on: 86% of the query is spent on "Top N Sort".
Let's get some other solutions so we can compare them in the end in the same execution plan.

Another solution is to just use the result of the query that computes the distance and just join again on the distance. That means we would compare each row in A with each row in B twice, once for the computation of the MIN function and the other for the join:
SELECT j.*, 
b2.*
FROM (SELECT
a.id,
a.x,
a.y,
Min(( a.x - b.x ) * ( a.x - b.x ) + ( a.y - b.y ) * ( a.y - b.y )) AS m
FROM a
CROSS JOIN b
GROUP BY a.id,
a.x,
a.y) j
INNER JOIN b b2
ON j.m = ( j.x - b2.x ) * ( j.x - b2.x ) + ( j.y - b2.y ) * ( j.y - b2.y )

Something that does the same thing, essentially, but looks a little better is joining the table A with B and then again with B on having the point from B2 be closer to the one in B1, but then adding a condition that there is no B2 (in other words, B1 is closest):
SELECT a.*, 
b1.*
FROM a
CROSS JOIN b b1
LEFT JOIN b b2
ON ( a.x - b1.x ) * ( a.x - b1.x ) + ( a.y - b1.y ) *
( a.y - b1.y ) >
( a.x - b2.x ) * ( a.x - b2.x ) + ( a.y - b2.y ) *
( a.y - b2.y )
WHERE b2.id IS NULL

None of these solutions scan B only once for each row in A. Their relative complexity is this: 75%, 11% and 14%, respectively. In other words, finding the minimum distance and then joining with the B table again on the points that are in exactly that distance is the best solution. However, given some assumptions and a weird structure, we can get to something that runs in half that time:
SELECT id      AS Aid, 
x,
y,
m % 100 AS bId
FROM (SELECT a.id,
a.x,
a.y,
Min(Cast( ( ( a.x - b.x ) * ( a.x - b.x ) + ( a.y - b.y ) * ( a.y - b.y ) ) AS BIGINT) * 100 + b.id) AS m
FROM a
CROSS JOIN b
GROUP BY a.id,
a.x,
a.y) j

These are the assumptions that must be true in order for this to work:
  • The function value can be converted to a BIGINT without problems. (if the distance between points would have been subunitary, this would have lost precision)
  • The maximum ID in table B is under a certain value (in this case 100)
  • The converted function multiplied by this maximum number doesn't cause an overflow
Basically I am mathematically creating a container for the value of the function and the id of the point in B, computing the minimum, then extracting the id back from the value. Neat.

Another solution, one that makes most apparent sense, is using a feature that was introduced in SQL Server 2005: RANK. We rank the points in B to each point in A, based on our function, then we only get the first by selecting on the rank being 1. Unfortunately, this doesn't work as expected. First of all, you cannot use RANK in the WHERE clause, so you must select the rank first, then select from that selection to add the condition. This might mean horrid temporary data tables if tables A and B are huge. Also, after running the query, it appears it is slower than the one that joins on the minimum distance. Here it is:
SELECT aid, 
bid
FROM (SELECT a.id AS aId,
a.x,
a.y,
b.id AS bId,
Rank()
OVER(
partition BY a.id
ORDER BY (a.x-b.x)*(a.x-b.x)+(a.y-b.y)*(a.y-b.y) ASC) AS rnk
FROM a
CROSS JOIN b) x
WHERE rnk = 1

Comparing all the solutions so far, without the first naive one, with the nested selects, we get these values:
  1. Mathematical container of function value and id: 14%
  2. Selection of the minimum distance to each point and then joining with table B for the second time to look for the point that is at that distance: 21%
  3. Joining twice on the same table with the condition that one is better than the other and that the better one doesn't actually exist: 29%
  4. Using RANK: 36%, most surprisingly the worst solution

The final solution, adding some more computation in order to get rid of constants and some assumptions thus becomes:
DECLARE @MaxId BIGINT 

SELECT @MaxId = Isnull(Max(id) + 1, 1)
FROM B;

WITH q AS (SELECT A.id,
A.x,
A.y,
Min(Cast(Power(A.x-B.x, 2) + Power(A.y-B.y, 2) AS BIGINT) * @MaxId + B.id) AS m
FROM A
CROSS JOIN B
GROUP BY A.id,
A.x,
A.y)
SELECT id AS aId,
x,
y,
m % @MaxId AS bId
FROM q;


I am still looking and there is now a question on StackOverflow that attempts to get the answer from the community, so far with limited results.

and has 1 comment
The preferred method to display anything in Transact-SQL is PRINT. You can print a string, a variable, an expression. However, as anyone soon finds out, the message get all cached in a buffer and displayed after the entire query ends. So if you have several long running queries in a single batch and you want to get real time messages from them, PRINT doesn't work. A quick search directs you to another MS SQL directive: RAISERROR (note the creative spelling that makes one think more of hearing Katy Perry RROR rather than a proper error raising directive). Also note that Microsoft recommends using a new construct called THROW, introduced in SQL2012. However, it only looks like a lamer version of RAISERROR. They both send a message to the client instantly, but the problem they have is that they do not, as PRINT does, accept an expression. So if you want to print something like 'The computed key from the query is: '+@Key you are out of luck as you need to declare a new nvarchar variable, fill it with the message value then use it in RAISERROR.

But there is a better solution. RAISERROR not only throws something at the client, it also flushes the message cache. So something like this works: PRINT 'The computed key from the query is: '+@Key; RAISERROR('',0,1) WITH NOWAIT;.

Of course, being the dev that I am, I wanted to encapsulate this into something that would be reusable and also get rid of the need do use plus signs and conversion to NVARCHAR, so I created this procedure that almost works like PRINT should have:
CREATE PROCEDURE Write (@P1  NVARCHAR(max)=NULL, 
@P2 NVARCHAR(max)=NULL,
@P3 NVARCHAR(max)=NULL,
@P4 NVARCHAR(max)=NULL,
@P5 NVARCHAR(max)=NULL,
@P6 NVARCHAR(max)=NULL,
@P7 NVARCHAR(max)=NULL,
@P8 NVARCHAR(max)=NULL,
@P9 NVARCHAR(max)=NULL,
@P10 NVARCHAR(max)=NULL)
AS
PRINT Isnull(@P1, '') + Isnull(@P2, '')
+ Isnull(@P3, '') + Isnull(@P4, '')
+ Isnull(@P5, '') + Isnull(@P6, '')
+ Isnull(@P7, '') + Isnull(@P8, '')
+ Isnull(@P9, '') + Isnull(@P10, '')

RAISERROR('',0,1)

And you use it as DECLARE @now DATETIME = GetDate()Write 'The date today is ',@now. Nice, huh? Of course what you would have liked to do is Write 'The date today is ',GetDate(), but apparently stored procedures do not accept functions as parameters, and functions do not accept PRINT inside them.

I had this database table containing ranges (a start value and an end value). The challenge was creating a query that overlaps and transposes those ranges so I can say how many ranges are at any point in the total interval or values. As an example, "SELECT * FROM Ranges" would result in a table like:
StartEnd
1020
1030
2535
2040
and I am looking for something like this:
ValueCount
00
10
......
102
112
......
242
253
263

A naive implementation would get the minimum Start (or start with 0, as I did) and the maximum End, create an in memory or temporary table (Values) from min to max using an ugly WHILE block, then join it with the Ranges tables something like:
SELECT v.Val,Count(1) as Nr
FROM #Values v
INNER JOIN Ranges r
ON r.Start<=v AND r.[End]>=v

This kind of works, but for large ranges it becomes difficult. It takes a lot to create the Values table and the join and for extreme cases, like I had with values from 0 to 6 billion, it becomes impossible. The bottleneck here is this Values table, which is pretty much a horror to create and maintain. But what if you don't need all the values?

Before I tell you the solution I found, be warned that you have to properly define what a range is. Is a range 10-20 actually 10-19? In my case it was so, so that is why there are some subtractions with 1 or less than rather than less or equal conditions.

The solution is this:
SELECT DISTINCT Val
INTO #Values
FROM (
SELECT 0 as Val
UNION ALL
SELECT Start FROM Ranges
UNION ALL
SELECT [End]-1 FROM Ranges
) x
ORDER BY Val

The idea is that after you compute the ranges count per each of the start and end values you know that between one and the next the count of ranges will remain the same. The join is significantly faster, there is no ugly WHILE block and you don't need a 6 billion value table. It's easier to plot on a chart as well, with either of these variations:
See the variations:
SELECT v.Val,
Count(r.Start) as Nr
FROM #Values v
LEFT JOIN Ranges r
ON r.Start<=v.Val AND r.[End]>v.Val
GROUP BY v.Val

The result of the query above will be:
ValueNr
00
102
192
202
253
293
342
391
Hope this helps you

I had this case today when I had to add a new column to a table with a gazillion rows and also create an index on this column. In my mind, having a column with NULL values would make the creation of the index instantaneous. I was wrong! In order to create the index, the SQL engine still scans all the rows and for tables with a lot of rows it takes a long while. But it felt really stupid. I knew that the column was filled with NULL values, I didn't need the computing of any index when I create it, instead only on INSERT/UPDATE/DELETE operations. So I started to look into solutions.

Enter filtered indexes! In Microsoft SQL Server 2008 an option for filters on indexes was introduced. The index must not be clustered and the definition it just the same as before, only with a WHERE clause that applies the filter. This seems to be the right solution to my scenario.

In order to test this I created a table called Test with two columns, a and b, both nvarchar(255). I filled the table with ten million rows having values for a and no values for b. Then I created an index on b; it took about 30 seconds. Then I created an index on a; it took 50 seconds. I removed the indexes and created a filtered index on b on the condition that b is not null. The operation was instantaneous. Success!
And here are the actual operations with more exact values (check out the comments for extra SQL tips on speed):
CREATE TABLE Test(a NVARCHAR(255),b NVARCHAR(255))

DECLARE @i INT = 625000 -- ten million divided by 16

BEGIN TRAN -- if not using a transaction, SQL will create a transaction per insert!

WHILE @i>0
BEGIN

SET @i=@i-1
INSERT INTO Test(a) VALUES(CAST(@i AS NVARCHAR(255)))

END

COMMIT TRAN -- 625000 rows with a filled and b empty [8 seconds]

-- insert selects are a lot faster than the while above
INSERT INTO Test(a)
SELECT a+'x1' FROM Test -- 1250000 total rows [3 seconds to create another 625000 rows]

INSERT INTO Test(a)
SELECT a+'x2' FROM Test -- 2500000 total rows [7 seconds]

INSERT INTO Test(a)
SELECT a+'x3' FROM Test -- 5000000 total rows [18 seconds]

INSERT INTO Test(a)
SELECT a+'x4' FROM Test -- 10000000 total rows [46 seconds]

CREATE INDEX IXa ON Test(a) -- [27 seconds]

DROP INDEX IXa ON Test

CREATE INDEX IXb ON Test(b) -- [25 seconds, approximately the same]

DROP INDEX IXb ON Test

CREATE INDEX IXa ON Test(a) -- there are no null values in the a column [56 seconds, double for a filtered index with a useless filter]
WHERE a IS NOT NULL

DROP INDEX IXa ON Test

CREATE INDEX IXb ON Test(b) -- [0 seconds - instant!]
WHERE b IS NOT NULL

A lot of good things can come from using filtered indexes, like for example a unique index on non-null values (which was pretty much a pain in the ass to do otherwise), but there are also gotchas. One of the things you have to watch out for is using filtered indexes on numeric columns. In this case the SET ARITHABORT ON command must be used (or insure in some other way that the option is on for all SQL sessions - SQL Management Studio and code both!, otherwise errors might occur or the index might be ignored. Also there seem to be some bugs that have not been addressed even in SQL 2012, like when using MERGE or when using filtered indexes on column being or not being null.

An more detailed article on this feature can be found here: SQL University: Advanced Indexing – Filtered Indexes.

We all know that the best way to prevent SQL injection is to use parameters, either in stored procedures or in parameterized queries. Yet on some occasions we meet code that replaces every single quote with two single quotes. The question arises: OK, it's ugly, but isn't it enough?

I have recently found out of something called "Unicode Smuggling" which uses the database against itself to bypass protection as described above. More details here: SQL Smuggling , but the basic idea is this: if the replacement scheme is implemented in the database and uses VARCHAR or maybe the code uses some non-unicode string, then the protection is vulnerable to this by leveraging what is known as Unicode Homoglyphs. If you feel adventurous and want to examine thoroughly the ways Unicode can be used maliciously, check out UTR#36.

Here is an example:
CREATE PROC UpdateMyTable
@newtitle NVARCHAR(100)
AS
/*
Double up any single quotes
*/
SET @newtitle = REPLACE(@newtitle, '''','''''')

DECLARE @UpdateStatement VARCHAR(MAX)

SET @UpdateStatement = 'UPDATE myTable SET title=''' + @newtitle + ''''

EXEC(@UpdateStatement)

Note the use of VARCHAR as the type of @UpdateStatement. This procedure receives a string, doubles all single quotes, then creates an SQL string that then is executed. This procedure would be vulnerable to this:
EXEC UpdateMyTable N'ʼ;DROP TABLE myTable--'

The first character in the provided string is not a single quote, but the Unicode character U+02BC . SQL will silently convert this into a single quote when stored in a VARCHAR. The injection will work.

Small demo in MS-SQL:
DECLARE @nhack NVARCHAR(100) = N'ʼ;DROP TABLE myTable--'
DECLARE @hack VARCHAR(100) = N'ʼ;DROP TABLE myTable--'
SELECT UNICODE(@nhack),UNICODE(@hack) -- Results: 700 39

More discussing this here: Can I protect against SQL Injection by escaping single-quote and surrounding user input with single-quotes?

I've often encountered this situation: a stored procedure needs to display a list of records ordered by a dynamic parameter. In Transact SQL, the Microsoft SQL server, one cannot do this elegantly in any way. I will list them all and tell you what the problem with each is.

First of all, let's start with an example. Assume we have a table called Test with a lot of rows, which has a datetime column which has an index on it. Let's call that TheDate to avoid any SQL keywords. We want to do something like this:
SELECT TOP 10 * FROM Test ORDER BY TheDate ASC

Notice that I want to get the top 10 rows, which means I only need a small part of the total. I also order directly by TheDate. In order to release a piece of code we also need to test it for performance issues. Let's look at the execution plan:


Now, let's try to order it dynamically on a string parameter which determines the type of the sort:
SELECT TOP 10 * FROM Test ORDER BY CASE WHEN @sort='ASC' THEN TheDate END ASC, TheDate DESC

As you see, I've used CASE to determine the sort order. There is no option to give a parameter as the sort order. The execution plan is this:


Surprise! The execution plan for the second query shows it is ten times slower. What actually happens is that the entire table is sorted by the case expression in a intermediate table result, then 10 items are extracted from it.

There must be a solution, you think, and here is an ingenious one:
DECLARE @intSort INT = CASE WHEN @sort='ASC' THEN 1 ELSE -1 END
SELECT TOP 10 * FROM Test ORDER BY CAST(TheDate AS FLOAT)*@intSort ASC

I transform the datetime value into a float and then I use a mathematical expression on it, multiplying it with 1 or -1. It is the simplest expression possible under the circumstances. The execution plan is:


Bottom line, there is no exception to the rule: when you order by an expression, SQL Server does not use indexes, even if the expression is easily decompilable. Don't get mislead by the apparent functional programming style of SQL syntax. It doesn't really optimize the execution plan in that way.. Even if the column is an integer, it will not work. Ordering by TheInteger is fundamentally faster than ordering by -TheInteger.

And now the solution, ugly as it may be (imagine the select is a large one, with joins and where conditions):
IF @sort='ASC' 
BEGIN
SELECT TOP 10 * FROM Test ORDER BY TheDate ASC
END
ELSE
BEGIN
SELECT TOP 10 * FROM Test ORDER BY TheDate DESC
END

Yes, the dreaded duplication of code. But the execution plans are now equivalent: 50%/50%.

This post was inspired by real events, where the production SQL server went into overdrive trying to create and manage many temporary result tables from a stored procedure that wanted to avoid duplication using the CASE method.

Update: there is, of course, another option: creating an SQL string in the stored procedure that is dynamically modified based on the sort parameter, then the SQL executed. But I really dislike that method, for many reasons.

More information from another blogger: Conditional Order By. They also explore rank using windows functions and in one of the comments there is a reference to SQL 2008 "Grouping Sets" which I have not covered yet.

A little known (at least by the people I've talked to) feature of Transact SQL (the Microsoft SQL engine) is the setting of ROWCOUNT. Usually ROWCOUNT is used to get the number of rows an operation has returned or affected and it is actually @@ROWCOUNT. Something like this:
UPDATE MyTable SET Value = 10 WHERE [Key]='MySetting'
SET @RowsUpdated = @@ROWCOUNT

Instead, setting ROWCOUNT tells the SQL engine to return (or affect) only a specified number of rows. So let's use the example before:
SET ROWCOUNT = 1
UPDATE MyTable SET Value = 10 WHERE [Key]='MySetting'
SET @RowsUpdated = @@ROWCOUNT
In this case a maximum of one row will be updated, not matter how many rows exist in the table with the value in the Key column 'MySetting'. Also, @@ROWCOUNT will correctly output 1 (or 0, if no rows exist).

Now, you will probably thing that setting ROWCOUNT is equivalent to TOP and a lot more confusing. I had a case at work where, during a code review, a colleague saw two SELECT statements one after the other. One was getting all the values, with a filter, and another was selecting COUNT(*) with the same filter. He correctly was confused on the reason why someone would select twice instead of also selecting the count of rows returned (or using @@ROWCOUNT :) ). The reason was that there was a SET ROWCOUNT @RowCount which restricted the number of rows returned by the first SELECT statement.

Here comes the gotcha. Assuming that setting ROWCOUNT is equivalent to a TOP restriction in the SELECT statement (in SQL 2000 and lower you could not use a variable with the TOP restriction and I thought that's why the first solution was used) I replaced SET ROWCOUNT @RowCount with SELECT TOP (@RowCount). And suddenly no rows were getting selected. The difference is that if you set ROWCOUNT to 0, the next statement will not be restricted in any way. Instead, TOP 0 will return 0 rows. So, as usual, be careful with assumptions.

There are other important differences between TOP and SET ROWCOUNT. TOP accepts both numeric and percentage values. Also, SET ROWCOUNT will NOT work on UPDATE, DELETE and INSERT statements from the version of the SQL server after 2012, so it's basically obsolete. Also, the query optimizer can consider the value of expression in the TOP clauses during query optimization. Because SET ROWCOUNT is used outside a statement that executes a query, its value cannot be considered in a query plan.

Update: in SQL 2012 a new options has been added to the ORDER BY clause, called OFFSET and FETCH, that finally work like the LIMIT keyword in MySQL.

I had to do a very simple Microsoft SQL query in which I wanted to update some of the values in a row from a row in the same table. Actually, the query was already there, but was using two local variables to store the information, then make the update. Something like this:
DECLARE @Var1 INT
DECLARE @Var2 INT
SELECT @Var1=Column1,@Var2=Column2 FROM MyTable WHERE ID=1
UPDATE MyTable SET Column1=@Var1,Column2=@Var2 WHERE ID=2
I really hated that I was using two SQL statements and all that declaring to do a simple update, so I looked up the syntax for the UPDATE statement. It said that if I want to update a table from a source I need to use the FROM keyword, like this:
UPDATE MyTable 
SET Column1=Alias.Column1,Column2=Alias.Column2
FROM MyOtherTable AS Alias
WHERE ID=2
AND Alias.ID=1
As you can see, we use an alias to name another table or query, we use the Alias name for all the conditions for that table and nothing for the conditions on the table we update. Easy, no? I even tested it and it worked. So I tried this:
UPDATE MyTable 
SET Column1=Alias.Column1,Column2=Alias.Column2
FROM MyTable AS Alias
WHERE ID=2
AND Alias.ID=1
I used the same table to update and to alias and it seemed to work. However, the number of updated columns was always 0. Remarkable how difficult it is to find on the net a straight answer about a simple situation like this.

It turns out that even with the alias, MSSql is confusing some things. The solution is to use a query from your table, rather than the name of the table itself. Here is how you do it:
UPDATE MyTable 
SET Column1=Alias.Column1,Column2=Alias.Column2
FROM (SELECT * FROM MyTable) AS Alias
WHERE ID=2
AND Alias.ID=1


SQL 2005 also introduced Common Table Expressions, which can be used to clarify a query. In this case, using a CTE results in the same execution plan and makes the entire query even more convoluted:
WITH Alias(Column1,Column2)
AS (
SELECT Column1, Column2 FROM MyTable
)
UPDATE MyTable
SET Column1=Alias.Column1,Column2=Alias.Column2
FROM Alias
WHERE ID=2
AND Alias.ID=1

Even if the documentation says you can specify a CTE without declaring the column names, I couldn't do it in this situation, I don't know why. I admit I only tried the CTE solution for a minute before discarding it as too verbose.

Let's start with an example:
DECLARE @SiteId INT
SELECT @SiteId=isnull(SiteId,0) FROM Orders WHERE OrderID=15
UPDATE Order_Sites SET SiteID=@SiteId


Can you spot the problem? What if SiteID in Order_Sites is not nullable? What if there is no order with OrderId 15?

That's right, when you select into a variable, you must be certain that the query returns any rows, otherwise the variable will not be set at all.

The solution is to add another operation that sets the value correctly. Here are three possible options:
  • Set @SiteId to 0 before the select.
  • Set @SiteId to isnull(@SiteId,0) after the select and simplify the select to not contain the isnull.
  • Use the select as an argument of the isnull operation:
    SET @SiteId= isnull((select SiteId from Master_orders where OrderID=-1),0)
    Yes, you can do that.


Either way, always pay attention to this gotcha in using SQL.

There was this FTP surrogate program that used SQL as a filesystem. I needed to store the size of the file (which was an HTML template and was stored as NTEXT) in the row where the content was stored. The problem is that the size of a text in a Microsoft SQL Server NTEXT column is about two bytes per character, while the actual size of the content, stored web like in UTF8, was different to almost half.

I thought that there must be an easy way to compute it, trying to cast the string to TEXT then using LEN, trying DATALENGTH, BINARY, etc. Nothing worked. In the end I made my own function, because the size of a string in UTF8 is documented on the Wikipedia page of that encoding: 1 byte for ASCII characters (character code<128), 2 bytes for less than 2048, 3 for 65536 and 4 for the rest. So here is the sql function that computes the size in UTF8:

CREATE FUNCTION [fn_UTF8Size]
(
@text NVARCHAR(max)
)
RETURNS INT
WITH SCHEMABINDING
AS

BEGIN
DECLARE @i INT=1
DECLARE @size INT=0
DECLARE @val INT
WHILE (@i<=LEN(@text))
BEGIN

SET @val=UNICODE(SUBSTRING(@text,@i,1))

SET @size=@size+
CASE
WHEN @val<128 THEN 1
WHEN @val<2048 THEN 2
WHEN @val<65536 THEN 3
ELSE 4
END
SET @i=@i+1

END

RETURN @size
END


A similar approach would work for any other encoding.

I was debugging an application when I've noticed that there was an exception thrown in one of the legacy DAL modules. A method was using regular ADO.Net to run a stored procedure, fill a DataSet, then proceeded on getting an IDataReader for the set, using the CreateDataReader method. On reader.Read() the exception DataTableReader is invalid for current DataTable 'Table' was thrown.



I've investigated the issue only to notice that the DataSet was correctly retrieved from the database, the only problem came when creating the reader. Any meaningful property threw this error. I've found a thread that discussed this problem here, but the answer was not there. Instead, I read an obscure line somewhere that said this exception is thrown when the DataSet has changes and tried that solution: before running CreateDataReader, I ran DataSet.AcceptChanges. And it worked!



The strange part comes now: I did the AcceptChanges bit in the Watch window during debug, not as a permanent change to the code. From then on, the code worked, no matter how many times I ran iisreset or restarted the browser. I've added the solution in the code for good measure, but I am still not sure if this is the solution or simply some fluke of the universe. One possible answer is the "race condition" described in this discussion, which also suggests this happens in debug mode only. Strange, innit?



Update: AcceptChanges did not solve the issue on the production server. I am still investigating, but if you know what this is about, please share :)

I needed to pass an array of IDs to a stored procedure on SQL Server 2008. This version of the server supports user defined table types and a way to access it from .Net, of course. A comprehensive resource for sending arrays to any version of SQL Server can be found here.



Long story short, for 2008 you first define a user table type that has a single int column (we are talking about an array of integers here, obviously), then a stored procedure that takes a parameter of that type. A way to send the array from .Net code is detailed here. As you can see, you create an array of something called SqlMetaData, holding the information of each column as defined in the user defined type, then you use an SqlParameter of SqlDbType Structured and with the TypeName the name of the user defined table in SQL Server. The parameter will have a list of SqlDataRecord instances that have the integer values in their first columns. Yes, there is an even longer story and I consider this short :-P



All nice and easy, but there is a caveat, something that is not immediately obvious from the code. The column metadata is set as a property value for any of the records that are added to the sql parameter value list. What if the list is empty? In this case it appears that there is a bug somewhere. The stored procedure fails, I guess because it does not receive the structure of the user defined table declared in the metadata and cannot map it to the user defined type.



A solution for this is to add a dummy SqlDataRecord with no values and then, in the stored procedure, check for NULL. A very ugly solution. The solution on Erland Sommarskog's blog did not say anything about this specifically, but I did find this: There are a few peculiarities, though. This does not work:

EXEC get_product_names NULL


but results in this error message:

Operand type clash: void type is incompatible with integer_list_tbltype


It is quite logical when you think of it: NULL is a scalar value, and not a table value. But what do you think about this:

EXEC get_product_names


You may expect this to result in an error about missing parameters, but instead this runs and produces an empty result set!
. Therefore the solution I used was to check in code if the .Net list of integers was empty and, in that case, do not send a parameter to the stored procedure. And it worked.

I was comparing these two MySQL queries. They looked the same, except for some extra joins and groups, but one was lasting for 5 seconds, the other for 2 minutes. I removed all extra stuff from the long lasting query to get to an almost identical query as the first, only to see it run for two minutes again. The only difference was the type of a WHERE clause. Instead of comparing date with "20101012" (string) I would compare it with 20101012 (integer). Apparently, the type conversion alone was invalidating any use of the index on the date column and made the query last forever. Good to know.

In simple words, it seems there is none. After doing a query like SELECT stuff FROM (SELECT * FROM Table1 UNION ALL SELECT * FROM Table2) x WHERE condition, where Table1 and Table2 are identical in signatures, I noticed that it took 30 seconds on a combined 1 million rows. So I thought I would try to move the condition on each of the UNIONed tables: SELECT stuff FROM (SELECT * FROM Table1 WHERE condition UNION ALL SELECT * FROM Table2 WHERE condition) and it took 2 seconds.

So it seems as the server performed the union on all the rows, perhaps moving them in a temporary table, then filtered on the condition.

Coming from the world of Microsoft Sql Server which carefully creates a query execution tree and routinely solves queries similar to there two in an identical and optimised fashion, I was a bit surprised. Then again, being a Linux MySQL, maybe it was just not compiled right or didn't have an obscure option set up or something like that. Anyway, I was dissapointed.