Adding comments to SQL code generated by Entity Framework Core

Tonight I went to an ADCES presentation about SQL table partitioning, a concept that allows for a lot of flexibility while preserving the same basic interface for a table one would use for a simpler and less scalable application. The talk was very professionally held by Bogdan Sahlean and you should have been there to see it :)
He talked about how one can create filegroups on which a table can be split into as many partitions as needed. He then demonstrated the concept of partition switching, which means swapping two tables without overhead, just via metadata, and, used in the context of partitions, the possibility to create a staging table, do stuff on it, then just swap it with a partition with no downtime. The SQL scripts used in the demo can be found on Sahlean's blog. This technology exists since SQL Server 2005, it's not something terribly new, and features with similar but limited functionality existed since SQL Server 2000. Basically the data in a table can be organized in separate buckets and one can even put each partition on a different drive for extra speed.
Things I've found interesting, in no particular order:
TABLESAMPLE (sample_number [ PERCENT | ROWS ] ) [ REPEATABLE (repeat_seed) ]
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
I've met an interesting case today when we needed to manipulate data from tens of thousands of people daily. Assuming we would use table rows for the information, then we get a table in which rows are constantly added, updated and deleted. The issue is with the space allocated in table pages.
SQL works like this: If it needs space it allocates some as a "page" which can contain more records. When you delete records the space is not reclaimed, it remains as is (this is called ghosting). The exception is when all records in a page are deleted, in which case the page is reused as an empty page. When you update a record with more data then it held before (like when you have a variable length column), the page is split, with the rest of the records on the page moved to a new page.
In a heap table (no clustered index) the space inside pages is reused for new records or for updated records that don't fit in their allocated space, however if you use a clustered index, like a primary key, the space is not reused, since there needs to be a correlation between the value of the column and its position in the page. And here lies the problem. You may end up with a lot of pages with very few records in them. A typical page is 8 kilobytes, so a table with a few integers in a record can hold hundreds of records on a single page.
Fragmentation can be within a page, as described above, also called internal, but also external, between pages, when the recycled pages are used for data that is out of order. To get a large swathe of records the disk might be worked hard in order to jump from page to page to get what is logically a continuous blob of data. It is disk input/output that kills a database.
OK, back to our case. A possible solution was to store all the data for a user in a "blob", a VARBINARY column. For reads or changes only the disk space occupied by the blob would be changed, with C# code handling everything. It's what is called trading CPU for IO, which is generally good. However this NoSql-like idea itself smelled badly to me. We are supposed to trust our databases, not work against them. The solution I chose is monitoring index fragmentation and occasionally issuing clustered index rebuilding or reorganizing. I am willing to bet that reading/writing the data equivalent to several pages of table is going to be more expensive than selecting the changes I want to make. Also, rebuilding the index will end up storing all the data per user in the same space anyway.
However, this case made me think. Here is a situation in which the solution might have been (and it was in a similar case implemented by someone else) to micromanage the way the database works. It made me question using a clustered index/primary key on a table.
These articles helped me understand more:
The new Datetime2 data type introduced in Microsoft SQL Server 2008 has several advantages over the old Datetime type. One of them is precision in 100 nanoseconds rather than coarse milliseconds, another is that is has a larger date range. It has disadvantages, too, like a difficulty in translating it into numerical values.
There was a classic hack to CONVERT/CAST a Datetime into a Float in order to get a numerical value that you could manipulate (like convert it to an integer to get the date without time, which is now accomplished by converting it to Date, another type introduced in SQL Server 2008), but it won't work directly for a Datetime2 value.
As there are many reasons why one needs to translate a datetime into a numerical value, but I don't get into that, here is how to convert a Datetime2 value into a Float.
First solution:
DECLARE @Time DATETIME2 = SYSDATETIME()
SELECT DATEDIFF(SECOND,{d '1970-01-01'}, @Time)
+DATEPART(NANOSECOND,@Time)/1.0E+9
Second solution:
DECLARE @Time DATETIME2 = SYSDATETIME()
SELECT DATEDIFF(DAY, {d '1900-01-01'}, @Time)
+DATEPART(HOUR,@Time)/24.0
+DATEPART(MINUTE,@Time)/(24.0*60)
+DATEPART(SECOND,@Time)/(24.0*60*60)
+DATEPART(NANOSECOND,@Time)/(24.0*60*60*1.0E+9)
Final solution:
DECLARE @Time DATETIME2 = SYSDATETIME()
SELECT (
DATEDIFF(SECOND,{d '1970-01-01'}, @Time)
+DATEPART(NANOSECOND,@Time)/1.0E+9
)/86400.0 + 25567
As a software developer - and by that I mean someone writing programs in C#, Javascript and so on, and occasionally using databases when something needs to be stored somewhere - I have an instinctual fear of the Arrow Anti-pattern. Therefore I really dislike stuff like NOT EXISTS(SELECT 1 FROM Something). However, recent tests have convinced me that this is the best solution for testing for existing records. I am not going to reinvent the wheel; here are some wonderful links regarding this, after which I will summarize:
Let's say you want to insert in a table all records from another source that do not already exist in the table. You have several options, but the most commonly used are:
SELECT *
FROM SourceTable
LEFT JOIN DestinationTable
ON SomeCondition
WHERE DestinationTable.Id IS NULL
and
SELECT *
FROM SourceTable
WHERE NOT EXIST(SELECT 1 FROM DestinationTable WHERE SomeCondition)
Personally I prefer the first version, for readability reasons as well as having listened to the mantra "Never do selects in selects" for all my life. However, it becomes apparent that the second version is a lot more efficient. The simple reason is that for the first example Microsoft SQL Server will first join the two tables in memory, then filter. If you have multiple combinations of records that satisfy the condition this will result in some huge memory and CPU usage, especially if you have no indexes defined and, sometimes, because you have some indexes defined. The second option uses one of the few methods guaranteed to exit, NOT EXISTS, which immediately invalidates a record at the first match.
Other options involve using the EXCEPT or INTERSECT operations in SQL, but they are not really helping. Intersecting ids, for example, then inner joining with SourceTable works, but it is somewhere in between the two solutions above and it looks like crap as well. Join hints don't help either.
CREATE TABLE MyTable
(
Id INT PRIMARY KEY IDENTITY(1, 1),
Value NVARCHAR(100)
)
CREATE TABLE AnotherTable
(
Value NVARCHAR(100),
AnotherValue NVARCHAR(100),
SomeConditionIsTrue BIT
)
go
CREATE TABLE #ids
(
Id INT ,
AnotherValue NVARCHAR(100)
)
INSERT INTO MyTable (Value)
OUTPUT inserted.Id INTO #ids (id)
SELECT Value
FROM AnotherTable
WHERE SomeConditionIsTrue = 1
-- Do something with the inserted Ids
INSERT INTO MyTable (Value)
OUTPUT inserted.Id,AnotherTable.AnotherValue INTO #ids (id,AnotherValue)
SELECT Value
FROM AnotherTable
WHERE SomeConditionIsTrue = 1
MERGE INTO MyTable USING (
SELECT Value , AnotherValue
FROM AnotherTable
WHERE SomeConditionIsTrue = 1
) t ON 1=0 --FALSE
WHEN NOT MATCHED THEN
INSERT (Value) VALUES (t.Value)
OUTPUT Inserted.Id, t.AnotherValue INTO #ids (Id, AnotherValue);
Msg 313, Level 16, State 3, Line 20 An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_ ... .Since this is an error message everyone associates with missing a parameter for a function, one might assume that the documentation is wrong and one must add another parameter. You try adding a bogus one and you get
Msg 8144, Level 16, State 3, Line 9 Procedure or function cdc.fn_cdc_get_all_changes_dbo_<table name> has too many arguments specified.which brings confusion. One hint is that if we use one less parameter than in the documentation, the error is slightly different
Msg 313, Level 16, State 3, Line 9 An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_dbo_<table name>.In this error message, the tracked table name is specified in the function name, as opposed to the other where ... is used instead. What is going on?
If the specified LSN range does not fall within the change tracking timeline for the capture instance, the function returns error 208 ("An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes.")., which of course is the explanation for this weird behaviour, but why and when does it happen?
The issue here is the inability to do raiseerror from within a function that prevents us from bubbling up meaningful error message.If one looks at the source of cdc.fn_cdc_get_all_changes_dbo_<table name>, one sees that the error is thrown from another function, a system one, called [sys].[fn_cdc_check_parameters]. Doing a select on it we get the same original error which is now slightly humourous, since it comes from a completely different function than the one in the message. Since it is a system function this time, there is no source code for it.
Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. I had set the TransactionOptions.Timeout to TransactionManager.MaximumTimeout and the SqlCommand.CommandTimeout to 0 (meaning never end) and I still got the exception. Apparently, the problem was the SqlConnection.ConnectTimeout which is a readonly property with a default value of 15 seconds. The value can be changed via the connection string, by adding something like Connect Timeout=36000 (10 hours) and many articles on the Internet suggest doing that. However, that is just really ugly. A better solution is to set the value of the timeout programmatically and this is how to do it:
var timeout = TimeSpan.FromHours(10);
SqlConnectionStringBuilder csb = new SqlConnectionStringBuilder(connectionString);
csb.ConnectTimeout = (int)timeout.TotalSeconds;
connectionString = csb.ConnectionString;
Update: after another very useful comment from NULLable, I tried several new ideas:
Update 2: I kind of understood the Relational Interval Tree implementation, but I couldn't find a way for it to help me. The code there creates a computed column of the same type as the IP columns then makes a BETWEEN comparison and/or a join or an apply with two table functions. I can't imagine how it could help me since the original query is basically just two BETWEEN conditions. But still a very interesting article.
I wanted to have a database of all Ripe records, in order to quickly determine the Internet Service Provider for an IP. We are discussing IPv4 only, so the structure of the table in the database looked like this:
CREATE TABLE [dbo].[RipeDb](
[Id] [int] IDENTITY(1,1) NOT NULL,
[StartIp] [bigint] NULL,
[EndIp] [bigint] NULL,
[NetName] [nvarchar](450) NULL,
[StartTime] [datetime2](7) NULL,
[EndTime] [datetime2](7) NULL,
[ParentId] [int] NULL)
As you can see, I translate IPs into BIGINT so that I can quickly sort and select stuff. I also added a ParentId column that represents the parent ISP, as you have some huge chunk of IPs, split and sold to other ISPs, which in turn are selling bits of the IP range they own to others and so on. The data I receive, though, is a simple text file with no hierarchical relations.
The task, therefore, is to take a table like described above, with more than four million records, and for each of them find their parent, if any.
The simplest idea is to join the table with itself like this:
SELECT rp.Id as ParentId,
r.Id
FROM RipeDb r
INNER JOIN RipeDb rp
ON rp.StartIp <= r.StartIp
AND rp.EndIp >= r.EndIp
AND rp.EndIp - rp.StartIp > r.EndIp - r.StartIp
This gets all ancestors for each record, so we need to use a RANK() OVER() in an inner select in order to select only the parent, but that's beyond the scope of the article.
Since we have conditions on the StartIp and EndIp columns, we need an index on them. But which?
Through trial and error, more than anything else, I realised that the best solution is a clustered index on StartIp,EndIp. That is why the first column (Id) is not marked as PRIMARY KEY in the definition of the table, because it has to look like this:
[Id] [int] PRIMARY KEY NONCLUSTERED IDENTITY(1,1) NOT NULL
. Yes, primary keys don't have to be clustered.
But now you hit the snag. The process is EXTREMELY slow. Basically on my computer this query would end in a few days (as opposed to twice as much with a nonclustered index). What the hell is going on?
I tried several things:
At this point I kind of gave up. Days of work trying to figure out why this is going so slow reached a simple solution: 4 million records squared means 16 thousand billion comparisons. No matter how ingenious SQL would be, this will be slow. "But, Siderite, I have tables large like this and joining them is really fast!" you will say. True, with equality the joins are orders of magnitude faster. Probably there is either place for improvement in the way I used the indexes or in the way they are implemented. If you have any ideas, please let me know.
So did I solve the problem? Yes, of course, by not relying on an SQL join. Think about how the ranges are arranged. If we order the IP ranges on their start and end values, you get something like this:
For each range, the following is either a direct child or a sibling. I created a stored procedure that called itself recursively, which should have worked, but then it reached the maximum level of recursion in SQL (32 - a value that one cannot change!) and so I had to do everything myself. How? With a cursor. Here is the final code:
DECLARE @ParentIds TABLE (Id INT,StartIp BIGINT, EndIp BIGINT)
DECLARE @ParentId INT
DECLARE @Id INT
DECLARE @StartIp BIGINT
DECLARE @EndIp BIGINT
DECLARE @OldParentId INT
DECLARE @i INT=0
DECLARE @c INT
DECLARE curs CURSOR LOCAL FAST_FORWARD FOR
SELECT r.Id, r.StartIp, r.EndIp, r.ParentId
FROM RipeDb r
WHERE r.EndTime IS NULL
ORDER BY StartIp ASC, EndIp DESC
OPEN curs
FETCH NEXT FROM curs
INTO @Id, @StartIp, @EndIp, @OldParentId
WHILE @@FETCH_STATUS=0
BEGIN
DELETE FROM @ParentIds WHERE EndIp<@StartIp
SET @ParentId=NULL
SELECT TOP 1 @ParentId=Id FROM @ParentIds
ORDER BY Id DESC
SELECT @c=COUNT(1) FROM @ParentIds
IF (@i % 1000=0)
BEGIN
PRINT CONVERT(NVARCHAR(100),SysUtcDatetime())+' Updated parent id for ' + CONVERT(NVARCHAR(100),@i) +' rows. ' + CONVERT(NVARCHAR(100),@c) +' parents in temp table.'
RAISERROR ('', 0, 1) WITH NOWAIT
END
SET @i=@i+1
IF (ISNULL(@OldParentId,-1) != ISNULL(@ParentId,-1))
BEGIN
UPDATE RipeDb SET ParentId=@ParentId WHERE Id=@Id
END
INSERT INTO @ParentIds VALUES(@Id,@StartIp,@EndIp)
FETCH NEXT FROM curs
INTO @Id, @StartIp, @EndIp
END
CLOSE curs
DEALLOCATE curs
I will follow the explanation of the algorithm, for people hitting the exact issue that I had, but let me write the conclusion of this blog post: even if SQL is awesome in sorting and indexing, it doesn't mean that is the only solution. In my case, the SQL indexes proved to be a golden hammer that wasted days of my work.
So, the logic here is really simple, which makes this entire endeavour educational, but really frustrating to me:
It's that deceptively simple and the query now ends in 15 minutes instead of days.
Another issue that might be interesting is that after the original import is created, the new records added to the table should be just a few. In that case, the first join and update might work faster! The next thing that I will do is count how many items I need to update and use one method or another based on that.
Hope that helps someone.
using (var conn = new SqlConnection("Server=localhost;Database=Test;UID=sa;Trusted_Connection=True;"))
{
conn.Open();
using (var comm = new SqlCommand())
{
var paramName = "a'";
comm.Connection = conn;
comm.CommandText = "SELECT 1";
comm.Parameters.Add(new SqlParameter(paramName, SqlDbType.NVarChar, 100)
{
Value="text"
});
comm.ExecuteNonQuery();
}
conn.Close();
}
exec sp_executesql N'SELECT 1',N'@a'' nvarchar(100)',@a'=N'text'In this example the name of the parameter is properly handled in the string defining the name and type of the parameters, but it is NOT escaped during the parameter value declaration. A small change of the code with paramName="a='';DELETE * FROM SomeTable --" results in an interesting query string in the SQL Profiler:
exec sp_executesql N'SELECT 1',N'@a='''';DELETE FROM SomeTable -- nvarchar(100)',@a='';DELETE FROM SomeTable --=N'text'Strangely enough, when inspecting the SomeTable table, the values are still there, even if copying the text into SQL Management Studio actually deletes the values. A similar construction using stored procedures leads to a completely legal SQL that is recorded by SQL Profiler, but it doesn't really do anything:
using (var conn = new SqlConnection("Server=localhost;Database=Test;UID=sa;Trusted_Connection=True;"))... with the resulting SQL:
{
conn.Open();
using (var comm = new SqlCommand())
{
var paramName = "a='';DELETE FROM SomeTable --";
comm.Connection = conn;
comm.CommandText = "DoTest";
comm.CommandType = CommandType.StoredProcedure;
comm.Parameters.Add(new SqlParameter(paramName, SqlDbType.NVarChar, 100)
{
Value="text"
});
int k = 0;
using (var reader = comm.ExecuteReader())
{
while (reader.Read()) k++;
}
Console.WriteLine(k);
}
conn.Close();
}
exec DoTest @a='';DELETE FROM SomeTable --=N'text'
USE [master];As you can see, you need to know not only the name of the database, but also the logical name of the database file that you want to shrink. It is not even a string, it is like a keyword in the DBCC SHRINKFILE command. Even if it does work, one would benefit from encapsulating it into a stored procedure. Here is the final code:
GO
CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
DBCC FREESYSTEMCACHE ('ALL');
DBCC FREESESSIONCACHE;
GO
USE [tempdb]
GO
DBCC SHRINKFILE (tempdev, 3000); --- New file size in MB
CREATE PROC ShrinkDatabase(@DbName NVARCHAR(100),@SizeMB INT)
AS
BEGIN
DECLARE @filename NVARCHAR(255)
DECLARE @sql NVARCHAR(Max) = 'SELECT @filename = dbf.name FROM ['+REPLACE(@DbName,'''','''''')+'].sys.database_files dbf WHERE dbf.[type]=0'
EXEC sp_executesql @sql,N'@filename NVARCHAR(255) OUTPUT',@filename OUTPUT
SET @sql='USE [master];
CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
DBCC FREESYSTEMCACHE (''ALL'');
DBCC FREESESSIONCACHE;'
EXEC sp_executesql @sql
SET @sql='USE ['+REPLACE(@DbName,'''','''''')+'];
DBCC SHRINKFILE ('+REPLACE(@filename,'''','''''')+', '+CONVERT(NVARCHAR(100),@SizeMb)+');'
EXEC sp_executesql @sql
END
EXEC master.dbo.ShrinkDatabase 'tempdb',3000Take note that you cannot use this to "shrink up" the database. If the value you set is larger than the current size, the file will remain the same size as well as the setting for the initial size. Also take note of the fact that this stored procedure only shrinks the data file, not the log file (dbf.[type]=0).