Here is the scenario:

  • you have two tables: a source of recent data and an SQL Server destination table with the same structure that has to be updated with that data
  • they are both on the same server (we assume that if the data source is external you have copied it into a local temporary table)
  • the destination table is in active use
  • the source destination may be in active use also
  • the tables are large, locking them and waiting for the merge to finish is not an option
  • the difference between the tables is usually small. Not so small as to be irrelevant, but small relative to the size of the tables
  • we don't care about full data consistency (any intermediate state between the current and the updated one is acceptable)

There are some quirks in SQL Server that affect our task, mainly that if you try to modify 5000 rows or more of a table, then the locking will escalate from row locks to a table lock, turning it unusable during the merge. Also, for operations involving a lot of data, the log of the database will increase to accommodate that size, then the changes will be persisted and the log invalidated. This takes a lot of disk space and if anything happens during the operation, the entire operation will be rolled back, with all of the data in the log having to be restored, which also takes a lot of time, blocking the database.

The first idea is to find rows that are different, new or missing and just sync 5000 of them at a time. One can do this with a MERGE or classic INSERT, UPDATE, DELETE operations. And this works, but it has a major flaw: after the first 5000 rows have been found and synchronized, the next operation will go through them again anyway to find the next 5000 rows, and again, and again, and again. The execution time will increase exponentially with every new batch of rows.

What we actually need is to take a list of changes and apply them in batches, kind of like an artificial transaction log. The following stored procedure will take two parameters: the full schema name of the source table and the full schema name of the destination table. It will create a log of changes, take 4900 of them at a time and apply them to the destination table. The only restrictions are that the destination table has to have primary key columns and the source and destination tables have the same columns. For performance reasons, it's best that the source table also has the same primary keys or at least an index on the same columns. The usage would look like EXEC usp_MergeTables 'SomeSchemaIncludingDbo.SourceTable','MaybeSomeOtherSchema.DestinationTable'

I will explain what it does after the code:

CREATE OR ALTER PROC usp_MergeTables(@SourceTable NVARCHAR(256),@DestinationTable NVARCHAR(256))
AS
BEGIN

SELECT CAST(s.name as NVARCHAR(Max)) as schemaName,CAST(t.name as NVARCHAR(Max)) as tableName,CAST(c.name as NVARCHAR(Max)) as columnName,c.is_identity,c.is_computed, 
CASE 
  WHEN tp.name IN ( 'varchar', 'char', 'varbinary' ) THEN tp.name + 
                  CASE WHEN c.max_length = -1 THEN '(max)'       
                         ELSE '(' + CAST(c.max_length AS VARCHAR(4)) + ')' END       
  --types that have an unicode character type that requires length to be halved 
  WHEN tp.name IN ( 'nvarchar', 'nchar' ) THEN tp.name + 
                   CASE WHEN c.max_length = -1 THEN '(max)'       
                         ELSE '(' + CAST(c.max_length / 2 AS VARCHAR(4)) + ')'       
                                                              END
   --types with a datetime precision 
  WHEN tp.name IN ( 'time', 'datetime2', 'datetimeoffset' ) THEN tp.name + 
                                            '(' + CAST(c.scale AS VARCHAR(4)) + ')' 
  --types with a precision/scale 
  WHEN tp.name IN ( 'numeric', 'decimal' ) 
  THEN tp.name + '(' + CAST(c.precision AS VARCHAR(4)) + ',' +
                                             CAST(c.scale AS VARCHAR(4)) + ')'
  --timestamp should be reported as rowversion 
  WHEN tp.name = 'timestamp' THEN 'rowversion' 
  --and the rest. Note, float is declared with a bit length, but is 
  --represented as either float or real in types  
  ELSE tp.name 
END as typeName
INTO #tgtColumns
FROM sys.schemas s
INNER JOIN sys.tables t
ON s.schema_id=t.schema_id
INNER JOIN sys.all_columns c
ON t.object_id=c.object_id
INNER JOIN sys.types tp
ON c.system_type_id=tp.system_type_id
AND c.user_type_id=tp.user_type_id
WHERE REPLACE(REPLACE(@DestinationTable,']',''),'[','')=s.name+'.'+t.name

DECLARE @operSql NVARCHAR(Max)
DECLARE @delSql NVARCHAR(Max)
DECLARE @updSql NVARCHAR(Max)
DECLARE @insSql NVARCHAR(Max)

SELECT @operSql = CONCAT(N'DROP TABLE IF EXISTS #oper

SELECT *
   INTO #oper
FROM (
SELECT ',STRING_AGG(CONCAT(N'ISNULL(src.[',c.columnName,N'],tgt.[',c.columnName,N']) as [',c.columnName,N'],'),N', '),
N'  CASE
    WHEN ',STRING_AGG(CONCAT(N'src.[',c.columnName,N'] IS NULL'),N' AND '),N' THEN ''DEL''
    WHEN ',STRING_AGG(CONCAT(N'tgt.[',c.columnName,N'] IS NULL'),N' AND '),N' THEN ''INS''
    WHEN (
    	SELECT * FROM ',@SourceTable,N' R
    	WHERE ',STRING_AGG(CONCAT('R.[',c.columnName,N'] = src.[',c.columnName,N']'),N' AND '),N'
    	FOR XML PATH(''Row''), ELEMENTS XSINIL
      ) <> (
    	SELECT * FROM ',@DestinationTable,N' R
    	WHERE ',STRING_AGG(CONCAT('R.[',c.columnName,N'] = tgt.[',c.columnName,N']'),N' AND '),N'
    	FOR XML PATH(''Row''), ELEMENTS XSINIL
      ) THEN ''UPD''
   END as __oper__
   FROM ',@SourceTable,N' src (NOLOCK)
   FULL OUTER JOIN ',@DestinationTable,N' tgt (NOLOCK)
   ON ',STRING_AGG(CONCAT('src.[',c.columnName,N'] = tgt.[',c.columnName,N']'),N' AND '),'
) x
WHERE __oper__ IS NOT NULL

CREATE INDEX temp_id ON #oper(',STRING_AGG(CONCAT('[',c.columnName,N']'),N', '),N')
CREATE INDEX temp_oper ON #oper(__oper__)')
FROM #tgtColumns c
WHERE c.is_identity=1

SELECT @delSql = CONCAT(N'

DECLARE @batch TABLE(',STRING_AGG(CONCAT('[',c.columnName,N'] ',c.typeName),N', '),N',
                   PRIMARY KEY(',STRING_AGG(CONCAT('[',c.columnName,N']'),N', '),N'))

DECLARE @ROWS INT = 1

WHILE (@ROWS>0)
BEGIN

  DELETE TOP (4900) tgt
    OUTPUT ',STRING_AGG(CONCAT('deleted.[',c.columnName,N']'),N', '),N'
	INTO @batch
  FROM ',@DestinationTable,N' tgt (READPAST)
  INNER JOIN #oper src
    ON ',STRING_AGG(CONCAT('src.[',c.columnName,N'] = tgt.[',c.columnName,N']'),N' AND '),N'
  WHERE src.__oper__=''DEL''

  SET @ROWS=@@ROWCOUNT

  DELETE o
  FROM #oper o
  INNER JOIN  @batch b
  ON ',STRING_AGG(CONCAT('o.[',c.columnName,N'] = b.[',c.columnName,N']'),N' AND '),N'
  
  DELETE FROM @batch

  IF (@ROWS=0)
    SELECT @ROWS=COUNT(*) FROM #oper WHERE __oper__=''DEL''

END')
FROM #tgtColumns c
WHERE c.is_identity=1


SELECT @updSql = CONCAT(N'

SET @ROWS = 1

WHILE (@ROWS>0)
BEGIN

  UPDATE tgt
  SET ',(SELECT STRING_AGG(CONCAT('[',c.columnName,N'] = src.[',c.columnName,N']'),N', ')
FROM #tgtColumns c
WHERE c.is_identity=0 AND c.is_computed=0),N' OUTPUT ',STRING_AGG(CONCAT('inserted.[',c.columnName,N']'),N', '),N'
	INTO @batch
  FROM ',@DestinationTable,N' tgt (READPAST)
  INNER JOIN ( 
    SELECT TOP (4900) s.*
    FROM #oper o
	INNER JOIN ',@SourceTable,N' s
	ON ',STRING_AGG(CONCAT('s.[',c.columnName,N'] = o.[',c.columnName,N']'),N' AND '),N'
    WHERE __oper__=''UPD''
  ) src
    ON ',STRING_AGG(CONCAT('src.[',c.columnName,N'] = tgt.[',c.columnName,N']'),N' AND '),N'

  SET @ROWS=@@ROWCOUNT

  DELETE o
  FROM #oper o
  INNER JOIN  @batch b
  ON ',STRING_AGG(CONCAT('o.[',c.columnName,N'] = b.[',c.columnName,N']'),N' AND '),N'
  
  DELETE FROM @batch

  IF (@ROWS=0)
    SELECT @ROWS=COUNT(*) FROM #oper WHERE __oper__=''UPD''

END')
FROM #tgtColumns c
WHERE c.is_identity=1


SELECT @insSql = CONCAT(N'

SET IDENTITY_INSERT ',@DestinationTable,N' ON

SET @ROWS = 1

WHILE (@ROWS>0)
BEGIN

  INSERT INTO ',@DestinationTable,N'(',(SELECT STRING_AGG(CONCAT('[',c.columnName,N']'),N', ')
FROM #tgtColumns c
WHERE c.is_computed=0),N') OUTPUT ',STRING_AGG(CONCAT('inserted.[',c.columnName,N']'),N', '),N'
    INTO @batch
  SELECT TOP (4900) ',(SELECT STRING_AGG(CONCAT('s.[',c.columnName,N']'),N', ')
FROM #tgtColumns c
WHERE c.is_computed=0),N'
  FROM #oper o
  INNER JOIN ',@SourceTable,N' s
  ON ',STRING_AGG(CONCAT('s.[',c.columnName,N'] = o.[',c.columnName,N']'),N' AND '),N'
  WHERE __oper__=''INS''

    SET @ROWS=@@ROWCOUNT
	
  DELETE o
  FROM #oper o
  INNER JOIN  @batch b
  ON ',STRING_AGG(CONCAT('o.[',c.columnName,N'] = b.[',c.columnName,N']'),N' AND '),N'
  
  DELETE FROM @batch

  IF (@ROWS=0)
    SELECT @ROWS=COUNT(*) FROM #oper WHERE __oper__=''INS''

END

SET IDENTITY_INSERT ',@DestinationTable,N' OFF

DROP TABLE #oper
')
FROM #tgtColumns c
WHERE c.is_identity=1

DROP TABLE #tgtColumns

--PRINT @operSql
--PRINT @delSql
--PRINT @updSql
--PRINT @insSql

DECLARE @sql NVARCHAR(Max) = CONCAT(@operSql, @delSql, @updSql, @insSql)
EXEC sp_sqlexec @sql

END

OK, this is a large thing, but the principles used are simple:

  • first we create a table containing information about the columns of the tables: schema, table, column name, whether the column is primary key or computed, and the type name. The schema and table name are not used, but may be useful for debugging. Note that this SP doesn't check the tables have the same number and type of columns, or that there are primary keys on the destination. That's on you to ensure.
  • using the column information we create four strings that will contain the SQL for the following operations:
    • create an "operations table"
    • delete rows that are not needed
    • update rows that need to be updated
    • insert rows that are missing
  • there are four strings mostly for debugging purposes to keep them smaller than the 8000 characters that Microsoft SQL Server Management Studio can print at a time, but they are concatenated and executed as one.
  • implementation details:
    • we use FOR XML PATH('Row'), ELEMENTS XSINIL to generate a string with all the data in each row, so we don't have to compare rows column by column. We could have made this work with comparisons, but the code would have been bulky and ugly when comparing for NULL or for having values. ELEMENTS XSINIL will ensure that there is a difference between empty space and NULL.
    • a FULL OUTER JOIN is used to find (based on the primary key columns of the destination table) if rows need to be either deleted, updated or inserted. That operation is specified in the __oper__ column
    • the operations table is thus created, containing the primary key values and the operation required with two indexes: one on the primary keys and one on the operation. These indexes are not really that relevant, so one could choose to remove them.
    • a @batch table variable is used with a PRIMARY KEY on the primary key columns of the destination table, for performance reasons
    • the batching is done via a DELETE... OUTPUT operation. We delete the 4900 rows we process and we output them in the @batch table
    • for each segment we either: delete destination rows with the same primary keys for 'DEL', update destination rows with the same primary keys for 'UPD' and insert source rows with the same primary keys for 'INS'
    • if some rows were affected by the operation, then we continue in the same segment. If not, then we look in the operations table to see if there are still rows to be processed. No rows may be affected while there are still rows to be processed due to the locking avoidance hints
  • improvements to avoid locking:
    • when we generate the operations table we use NOLOCK, which reads uncommitted values ignoring locks. This may not be what you want, but if the source table is locked for whatever reason, this ensures the merge operation is not blocked
    • when we process the batches we use READPAST, which ignores locked destination rows. This ensures that rows that can be updated will be, while the others can be done later, meaning that if you have 1 locked row, the merge operation will go around it, then wait until it is unlocked to update or delete it.

If you want to see what the generated SQL looks like, uncomment the PRINT lines and comment the EXEC one.

Now, I just wrote this stored procedure, so I may have missed some cases. Let me know if you find a situation where this doesn't work as expected. 

Hope it helps!

What is the structure of a table created from another via SELECT * INTO [Second] FROM [First] ?

A simple question, indeed, a basic one, but one that I have never asked myself until today. I honestly believed that the result is a generic table containing the same column names and types and their values and nothing else.

The answer, though, is strange. If the original column is NOT NULL, the resulting column will also be NOT NULL. If the original column has IDENTITY, the column in the second table will also have IDENTITY. And, as you know, you can't add or remove IDENTITY from existing columns without dropping and adding them back. The DEFAULT value, though, is not transferred.

The Microsoft reference page says: 

The format of new_table is determined by evaluating the expressions in the select list. The columns in new_table are created in the order specified by the select list. Each column in new_table has the same name, data type, nullability, and value as the corresponding expression in the select list.

...

When an existing identity column is selected into a new table, the new column inherits the IDENTITY property, unless one of the following conditions is true:

  • The SELECT statement contains a join.
  • Multiple SELECT statements are joined by using UNION.
  • The identity column is listed more than one time in the select list.
  • The identity column is part of an expression.
  • The identity column is from a remote data source.

...

Indexes, constraints, and triggers defined in the source table are not transferred to the new table, nor can they be specified in the SELECT...INTO statement. If these objects are required, you can create them after executing the SELECT...INTO statement.

...

Specifying an ORDER BY clause does not guarantee the rows are inserted in the specified order.

...

When a computed column is included in the select list, the corresponding column in the new table is not a computed column. The values in the new column are the values that were computed at the time SELECT...INTO was executed.

So, my assumptions were wrong, but still what do you do when you want to achieve exactly that: create a new schema-less table with the same columns in terms of name and data type and value only? It's a bit ugly, but I have not found a better alternative. Just UNION ALL with a SELECT that has the same number of (nullable) columns and no rows. like this:

SELECT *
  INTO [Second]
FROM [First]
  UNION ALL
(SELECT NULL, NULL,NULL,NULL WHERE 1=0)

Assuming that the First table had four columns, this will result in a table having the same column names, types and values as the original table, but all columns will be nullable and with no identity.

Hope it helps!

and has 0 comments

  I picked up the book and I went "Oh, no! the writer of this book is called qntm. He must be one of those YA Twitter writers!". No he isn't! Actually, after reading this book, I am determined to read more from Sam Hughes. There is No Antimemetics Division is a wonderful book, blending non Lovecraftian cosmic horror with the spirit of something like The Andromeda Strain in a sort of Fringe-like story, but much better.

  A collection of chapters that contain almost separate short stories, the book slowly constructs this amazing world in which "antimemetic" entities exist. Don't worry about it, a few minutes after you're read my review you will have forgotten about them. The idea is cool and fresh enough, but the amount of attention to detail that the author poured into it raises it to a different level. And then more ideas keep piling up.

  The book starts from quite frightening situations that boggle the mind and challenge the way we see reality, then continues to consistently up the ante to ridiculous scales. Yes, some things are slightly contradictory and hard to accept, but image this is happening in a world that your brain screams it couldn't possible exist, while the clinical scientific and technical viewpoint of the main characters convince you it just as well might.

  I've already put another book from this author on my to read list. I loved this book and I wholeheartedly recommend it.

and has 0 comments

  I like Sam Neill. He is a good actor and he played in some movies I really like. Funny enough, he doesn't mention those! Instead he focuses in ones that meant more to him and that I mostly haven't watched.

  Did I Ever Tell You This? is a large collection of small anecdotes from the author's life that he decided he needed to write down when he was diagnosed with cancer. They are funny, heartwarming, but strangely impersonal, like stories one tells at a wine table, meant to entertain, not share, offend or expose. For that reason alone it was hard for me to finish the book.

  Imagine being at a party with friends, having fun, Sam bloody Neill being there telling everyone how he most certainly did NOT fuck every woman in London. It would be great, right? Only he keeps talking and talking and talking. Very little about the dramas in his life, the marriages, the children, he just goes on and on about funny things that happened to him, when he was working with people that he thinks are all great, women, and people of color and Aboriginals and all wonderful actors and human beings. It gets old fast! That's this book.

  Now, I like the guy and he came off well out of the book. The problem is that I don't feel like I know him more now than before. He's an average Kiwi, happy to have been chosen to join a great cast of film people and trying to make good with what he got. Humble like. Kind. Funny. Doesn't feel like a real person at all!

  Anyway, the book was fine. It was just overly long and not hitting hard enough.

and has 0 comments

  Nick Cutter went to the store, bought the largest bag of horror tropes and then poured them all into The Deep. Imagine a cross between Event Horizon, It and The Thing, with every other horror cliché you could think of sprinkled in and you get this book. Unfortunately, it doesn't work. Do you feel the horror? YEAH! But does it actually have any impact? No. It gets so horrid so fast that your mind just goes numb and asks itself why is it reading the story at all.

  The Deep has it all: horrible parents, child abuse, loving couple torn apart by child abduction, child fears, parental murder, psychopathy, body horror, supernatural horror, cosmic horror, claustrophobic horror, animal cruelty, interdimensional evil, gore, hopelessness, losing your mind, nightmares, global pandemic and, king of them all, "let's separate!" horror. Well, I am being a bit mean, because by that point nothing really mattered, but you get my drift.

  I guess there are two types of horror as far as I am concerned: intriguing and numbing. The first one is always built on hope, hope that the some character has a chance, if only they would make the best choices and would have a bit of luck, they could pull through. Maybe add some irony, something ridiculous that gives that person an edge when it matters most. The second one is just pointless witnessing of the suffering of another when they have no chance in hell they could pull through. The Deep veers really fast and really soon towards the second category. The horror is strong, but without a reason to exist. And boy does the guy fail to make the right choices!

  Yet, if you watched Event Horizon and thought it was great, like I did, maybe you will love this book, too. Personally I think this felt more experimental than, err... deep.

and has 0 comments

  Yes, I confess, I only expedited the reading of Mickey 7 because there is a Mickey 17 movie adaption with a pretty big budget and cool cast. I already see your eye roll for yet another review about an unreleased movie and not the actual book, but I promise I am writing about what I've read, so stick around :)

  This is a book akin to The Martian or maybe more the Bobiverse books, with which it also shares some plot elements: first person, light action, reasonable and emotionally stable protagonist and capable of being replicated after he dies or, as is the case of this story, when people thought he died. I had fun with it, read it really fast and served as a great palate cleanser after a really annoying book I slogged through before.

  It's not a masterpiece of literature, but it's good and fun. Edward Ashton is a decent writer and if I had an issue with his craft is with people being too consistent in their behavior. They all are neatly placed into their nice little boxes and they never get out of them, even in the face of traumatic deaths (of others or their own). The book also kind of drags, focusing too much on trivialities and less on the interesting aspects of the characters. However, this is the setup book, a first in a series as is tradition, so maybe the next volume, Antimatter Blues, will be better. I intend to read it, too. Maybe not immediately, though. Let's see how I feel about the movie.

  Talking about the movie, I think it's clear they are going to change the plot significantly, therefore the different title. And I get it. The story is about colonizing an alien planet after years of relativistic travel, a lot of internal monologues, flashbacks, and shortened stories of other colonies from books that he reads, but also gruesome deaths and insectile alien life. Hard to cram that into a movie and keep within the budget.

and has 0 comments

  A People's Future of the United States: Speculative Fiction from 25 Extraordinary Writers is one of those books. 25 different short stories about how liberals, gays, women, non neuro-normative people and people of color are abused or openly hunted by possible future American systems. And the regimes described are quite awful indeed, yet the overall feeling one gets from reading the book is just eye rolling disbelief. I only read it because I like how Victor LaValle writes, problem is he just edited this collection and wrote none of the stories inside and it was a grind to get this read.

  I blame the first story. It was so beautiful and reasonable, where a librarian shelters people from both sides of a divided America and they come together in discussing a single book they had all read. It set a tone that I couldn't wait to get more of. And the second story was the most strident, hysterical, yet at the same time bland piece of literature I've ever read! And it was followed by a lot of similar stories, where everybody who was gay, of color, autistic, female and sometimes all of the above was openly and legally terrorized by a system run by the evil conservative Whites. The tonal shift was so brusque I felt literary whiplash!

  Maybe when your subject is systematic abuse of minorities and you're also part of one or multiple of these minorities, it's pretty hard to get openly rejected. That's the only reasonable explanation I could find for the wide variety of quality in these stories. There were some that were really good. Unfortunately, only a few of them and most of the others could only kindly be called mediocre.

  I just couldn't believe it! The same people who were afraid of an ever more improbable future (The Handmaid's Tale feeling heavenly in comparison) in which non-normalcy is illegal, were describing with glee how the wheel was turning. For example, there was one where a genetic time travelling bomb backfired and all of America was now diverse. That kind of laughable diversity, with no national identity, just a bunch of people all being different from each other, looking different, yet having a united community and all of the American problems magically gone.

  I couldn't shake a quote from my head: slaves don't dream of freedom, but of becoming masters. The same people that were afraid of intolerance wrote short stories about us vs them, where "them" were caricaturesque inhuman villains and deserved everything coming to them.

  For sure there is some emotional catharsis to imagine a terrible future in which everything you hate is openly evil, thus giving you permission for all the fear and hate and extreme fantasies. Imagine Death Wish, but now the hero is a gay woman in a wheelchair killing fraternity bros. How is that not a movie already? Barf!

and has 1 comment

I have been maintaining a Chrome browser extension I wrote for more than a year now and I always get the occasional user asking me if I can make it work with Firefox. And until now I said no, because I use the "world" feature in manifest.json. But I was wrong.

You see, Google is aggressively pushing for manifest version 3, discontinuing support for browser extensions using v2. So my extension was written from the beginning with version 3. In order to load the scripts and CSS files that I needed to run in the context of the page, I used the world:"MAIN" feature. When I tried it with v2, Chrome immediately told me "The 'world' property is restricted to extensions with 'manifest_version' set to 3 or higher." (Or higher. lol) So when I looked for how to use world in Firefox manifest v3 I got a lot of documentation about Xray Vision and how they absolutely refuse to implement the world feature.

I googled, I scoured the web, I read all of Stack Overflow, I even tried LLMs like ChatGPT of Gemini to hilarious results, like Gemini accusing me I want to circumvent the security of the browser. No. There is no simple way of doing the same thing in Manifest v3 on Firefox. Case closed, right? I mean, who even uses Firefox? Less than 2.5% of people on the Internet. And it's a shit browser.

But no, here comes a guy and tells me that he made it work. HOW?! By now you probably guessed it, I left enough hints after all. You just have to use manifest version 2! 'world' is only restricted for version 3 in Chrome browsers!!!

Although now that I am writing this post I see stuff like Manifest V3 updates landed in Firefox 128 so maybe it something that works only recently (the article is from July 2024). Hmm..

I guess there is a silver lining to the fact I refused users for a year now, because I had a year in which I didn't have to deal with the buggy debugger (heh!) in Firefox.

Anyway, if you have the same problem, that's your solution: make a v2 manifest for Firefox and a v3 manifest for Chromium browsers. Or use it in v3 as well, because apparently Firefox changed their minds. And I wrote this blog with such glee that I would help people that had the same problem as me. Ugh!

NO! I will not make it work for Safari! Eat a dick!

  Just a few days ago I was writing on how important it is to tell Entity Framework what SQL type to use in order to avoid costly conversions. In fact, it wasn't so much an EF issue as it was an SQL one. Converting even character types to character types or changing collation was surprisingly expensive.  In this post I will show you how important it is to choose the right type for your querying columns, especially the primary key. 

  First imagine this scenario: you get some data from an outside source, rows and rows of it, and you have to store them and query them in SQL. The value that uniquely identifies a row is a small string, maybe 50 characters long. How do you proceed?

  My first naive solution was the most obvious one: just create a table that has a column for each value in the rows and put the primary key on the identifying one. But this leads to immediate performance losses:

  • by default, a primary key is a clustered index - text is not sequential, so at every insert the database engine will physically move huge swaths of data in order to place the rows in the alphabetical order of their identifiers
  • a primary key is a unique index - meaning text will have to get compared to other text in order to determine uniqueness, which is slow
  • by default, SQL is case insensitive - meaning that all text comparisons will have to be made taking into account capitalization and accents
  • 50 characters is a lot - even without Unicode support, it's 50 bytes, which is 12 times more than an integer, meaning the primary key index will be large; and slow

  "But!", you will undoubtedly say, if you put the primary key on some other column, you will still have to create a unique index on the identifier. Isn't this just pushing the problem farther down the road? The size and speed limitations will be the same. And primary keys are clustered only by default, but they can be declared as not clustered. And SQL doesn't need to be case insensitive, all you have to do is change the collation of the column to be binary and it will be compared faster. Wouldn't that solve the problem?

  No. In fact, my final solution which worked five times faster, did not have an index on the identifier column AT ALL. Incidentally, I did end up changing the collation, but only because the idiots sending me the data were doing it case sensitive.

  Without further ado, here is what I did:

  • an INT column with IDENTITY(1,1) as the primary key - which ensures a fast insertion due to the sequential nature of the value, fast query speed and low usage of disk space for the index
  • an INT column holding the checksum of the identifier - which when indexed, is fast to query and doesn't use a lot of disk space for the index

   So how do I query on the identifier? Simple: I calculate the checksum of the string and then I look it up in the database - which uses the index to locate the few strings that have the same checksum, then just finds the right one by enumerating through them. I query on the checksum column AND the text identifier. And there is an added bonus: I only need to do this once. If I need the record from the DB again, I query it directly through the integer primary key.

  Entity Framework has this automatic memory cache so when I am querying on the database entity using a business model - as good separation of concerns practice would dictate - it gets it really fast from memory. Because the memory cache also uses just the int to identify an entity, which means double the benefits!

  The eagle eyed reader will have noticed that I am not using a unique index on the identifier, so technically I could create multiple rows with the same one. However, my application is always looking for the existing record first. But if you really worry about data consistency, the index on the checksum column can be replaced with a unique index on the checksum and identifier column. It will take more space, but it will be just as fast.

  Another thing that you may have noticed is that I use a code checksum, not the database provided functions to achieve the same. At first glance, it's an instant win: just create a persisted computed column that calculates the checksum or binary checksum of the identifier column. However, this would be weird when having to query, since you would have to craft a stored procedure or a custom SQL command to get the identifier and query on its checksum. In my case I just calculate a checksum - and not use the lazy string.GethashCode function which may be subject to change and it's already different between 32 and 64 bit systems.

  Of course, if you want your text columns to be case and/or accent insensitive, you will have to store the hash code of the lowercase and unaccented string or use an implementation that is case and accent insensitive. This may not be trivial.

  Further tests showed that just using a non clustered index on the identifier column, even a unique one, was just slightly slower, maybe 5%. However, the space taken by indexes increased by 20%. So I might understand why you would find it a bit off putting and skip the checksum part.

  Hope this helps!

  P.S. Why did this solution provide such a huge performance gain? Obviously the SQL team would have implemented a sort of checksum for their text index, this should have been working natively and faster than any possible implementation I could make. Well, I don't know the answer. In fact, this all could be some quirk of Entity Framework and the SQL queries would not be optimizable to such a degree. I will attempt to test that using purely SQL commands. But meanwhile, all the points I made above are valid and with a little more work you can have a lot more control on how the system works.

  What do you remember from the Terminator movies? It's the Skynet killer robot, obviously, the people who seem to always be related somehow, and a hero that needs saving for the sake of their work in the future, but running for their lives in the present. In Terminator Zero you get all of these, to the point that they feel a little overdone. But the animation is good and the story is interesting, adding some logical elements that I've only seen in Terminator: The Sarah Connor Chronicles, which I liked a lot and wanted more of. I loved that they set the action in a clearly different timeline than our own and also tried to make it clear the ridiculous cycle of trying to fix the past from the future.

  Unfortunately, they've decided to add children to the mix. And I mean children that need a nanny, not 24 year old Claire Danes. Most of the time it's the children and their very Japanese emotions filling the screen, while their father, a mysterious tech mogul, keeps saying cryptic things almost until the end of the movie for no good reason. The Terminator is thankfully not in the shape of Arnie and the human fighter from the future is a woman. It also is set in Japan. The series ends with a promise rather than with closure, although I doubt they will make a second season.

  It's eight episodes of 20 minutes each, but I think the story was a little too simple for 160 minutes and it could have easily been a more concise two hour animation film. What's the difference, really, between a series you release all at once and a feature film anyway?

  While I applaud stories said in animation - readers of this blog may already know that I believe that's how you do and say brave things today, especially in sci-fi and horror - being a Terminator story meant it was locked in some preestablished framework and couldn't be too creative. Just consider taking some pages out of Screamers, for example, and you understand what I mean. I would watch seasons and seasons of Terminator anime than hope for something decent in live action anymore. The thing is that they already are very far advanced in special effects, but those also cost a lot of money, meaning that you either underdeliver on viewer expectations or have to make a whole bunch of money to break even. Animation is not like that and it's also a lot more flexible.

  All in all I liked the show and I recommend it, but don't expect too much.

  I've built an application and, like any lazy dev out there, I focused on the business logic, the project structure, the readability, comments, the dependency injection, the unit tests, you know... the code. My preference is to start from top to bottom, so I create more and more detailed implementations of interfaces while going down to the metal. The bottom of this chain is the repository, that class which handles database access, and I've spent little to understand or optimize that code. I mean, it's DB access, you read or you write stuff, how difficult can it be?

  When it was time to actually test it, the performance of the application was unexpectedly bad. I profiled it and I was getting reasonable percentages for different types of code, but it was all taking too long. And suddenly my colleague says "well, I tried a few things and now it works twice as fast". Excuse me?! You did WHAT?! I have been trying a few things too, and managed to do diddly squat! Give me that PR to see what you did! And... it was nothing I could see.

  He didn't change the code, he just added or altered the attributes decorating the properties of models. That pissed me off, because I had previously gone to the generated SQL with the SQL Profiler and it was all OK. So I executed my code and his code and recorded the SQL that came out:

  • was it the lazy loading? Nope. The number of instructions and their order was exactly the same
  • was it the explicit declaration of the names of indexes and foreign keys? Nope. Removing those didn't affect performance.
  • was it the ChangeTracker.LazyLoadingEnabled=false thing? Nope, I wasn't using child entities in a way that could be affected.
  • was there some other structure of the generated SQL? No. It was exactly the same SQL! Just my code was using thousands of CPU units and his was using none.
  • was it magic? Probably, because it made no sense whatsoever! Except...

Entity Framework generates simple SQL queries, but it doesn't execute them as you and I would. It constructs a string, then uses sp_executesql to run it. Something like this:

exec sp_executesql N'SELECT TOP(1) [p].[ID], [p].[TXT], [p].[LUP_TS]

FROM [sch].[table] AS [p]

WHERE [p].[ID] = @__p_0',N'@__p_0 nvarchar(64)',@__p_0='xxxx'

Do you see it? I didn't until I started to compare the same SQL in the two versions. And it was the type of the parameters! Note that the aptly named parameter @__p_0 is an NVARCHAR. The actual column in the database was VARCHAR! Meaning that the code above was unnecessarily always converting values in order to compare them. The waste of resources was staggering!

How do you declare the exact database type of your columns? Multiple ways. In my case there were three different problems:

  • no Unicode(false) attribute on the string columns - meaning EF expected the columns to be NVARCHAR
  • no Typename parameter in the Column attribute where the columns were NTEXT - meaning EF expected them to be NVARCHAR(Max)
    • I guess one could skip the Unicode thing and instead just specify the type name, but I haven't tested it
  • using MaxLength instead of StringLength - because even if their descriptions are very similar and MaxLength sounds like applying in more cases, it's StringLength that EF wants.

From 40-50ms per processing loop, it dropped to 21ms just by fixing these.

Long story short: parametrized SQL executed with sp_executesql hides a possible performance issue if the columns that you compare or extract have slightly different types than the one of the parameters.

Go figure. I hate Entity Framework!

and has 0 comments

  In The Memory Police Yōko Ogawa describes a small Japanese island ruled by "the memory police", an organization with apparent total power and no opposition whose entire purpose is to make sure the things that "are disappeared" are physically destroyed and arrest anyone on the island who is able to remember them. A very interesting metaphor on the things that only hold value if we remember and fight for them.

  Unfortunately, in this book no one fights for anything! I expected some sort of revelation on how this magical police can make disappear concepts from the minds of people so thoroughly that they can't even put them back in their memory when holding them in their hands. Or some sort of solution to said problem. Some sort of misguided attempt at a revolution. Something! But these are Japanese people, if things are supposed to disappear, they go with it until they are all gone.

  Was the author trying to convey the same frustration that I felt while reading the book? People so ritualistic and conformist that they basically amount to non playing characters, running the same routine until someone turns the game off? Because this frustration only combined with the ethereal quality of internal monologues who noticed things happening and ... then did nothing at all.

  I can't say the book was not decently written and the idea was intriguing, but if you expect the story to go anywhere, well, it doesn't.

Intro

  This post is about the System.InvalidOperationException: This SqlTransaction has completed; it is no longer usable. which may be because you shared your SqlConnection or you tried to SaveChanges twice and all of the other issues that you can google for. I was not so lucky. I spent a day and a half to understand what's going on and only with a help of another dev did I get close to the issue.

TL;DR;

I used a column with identity generation, but it wasn't also a primary key and EF sucks.

Details

  Imagine my scenario first: I wanted to use a database to assign a unique integer to a string. I was first searching for the entry in the DB and, if not found, I would just insert a new one. The SQL Server IDENTITY(1,1) setting would insure I got a new unique value for the inserted row. So the table would look like this:

CREATE TABLE STR_ID (
  STR NVARCHAR(64) PRIMARY KEY,
  ID INT IDENTITY(1,1)
}

Nothing fancy about this. Now for the C# part, using Entity Framework Core 6.

I created an entity class for it:

[Table("STR_ID")]
public class StrId {

  [Column("STR")]
  [Key]
  public string Text { get; set; }

  [Column("ID")]
  [DatabaseGenerated(DatabaseGeneratedOption.Identity)]
  public int Id { get; set; }

}

And then I proceeded to test it in the following way:

  • create a DbContext instance
  • search for a value by STR/Text in the proper DbSet
  • if it doesn't exist, insert a new row and SaveChanges
  • retrieve the generated id
  • dispose the context

I also ran this 20 times in parallel (well, as Tasks - a minor distinction, but it was using the thread pool).

The result was underwhelming. It would fail EVERY TIME, with either an exception about deadlocks or 

System.InvalidOperationException: This SqlTransaction has completed; it is no longer usable.
   at Microsoft.Data.SqlClient.SqlTransaction.ZombieCheck()
   at Microsoft.Data.SqlClient.SqlTransaction.Commit()

I did what every sane developer would do in this situation and bought myself a shotgun (we all know it's the most effective against zombies) then googled for other people having this issue. I mean, it would be common, right? You do some EF stuff in parallel and you get some errors.

No. This is happening in a parallelism scenario, but that's not the cause. Also, it's not about transactions. EF will wrap SaveChanges operations in a transaction and that is causing the error, but the transaction being completed is the issue and no, it's not your code!

I tried everything I could think of. I disabled the EF transaction and made my own, using all types of IsolationLevel, I tried EnableRetryOnFailure with hilarious results (I was monitoring the values inserted in the database with NOLOCK and they were going to 20, then back to 10, then 15, then back to 1 and it was taking ages trying to retry operations that apparently had dependencies to each other, only to almost all to fail after a long time). I even disabled connection pooling, which probably works, but would have made everything slow.

Solution

While I can't say what EXACTLY caused the problem (I would have to look into the Microsoft code and I don't feel like it now), the solution was ridiculously simple: just make the IDENTITY column a primary key instead:

CREATE TABLE STR_ID (
  ID INT PRIMARY KEY IDENTITY(1,1),
  STR NVARCHAR(64)
}

-- because this is what I am searching for
CREATE UNIQUE INDEX IX_STR_ID_STR ON STR_ID(STR) 
[Table("STR_ID")]
public class StrId {

  [Column("ID")]
  [Key]
  public int Id { get; set; }

  [Column("STR")]
  public string Text { get; set; }

}

I was about to use IsolationLevel.ReadUncommitted for the select or just set AutoTransactionsEnabled to false (which also would have solved the problem), when the other guy suggested I would use this solution. And I refused! It was dumb! Why the hell would that work? You dummy! And of course it worked. Why? Donno! The magical thinking in the design of EF strikes again and I am the dummy.

Conclusion

What happened is probably related to deadlocks, more specifically multiple threads trying to read/write/read again from a table and getting in each other's way. It probably has something to do with how IDENTITY columns need to lock the entire table, even if no one reads that row! But what it is certain to be is a bug: the database functionality for a primary key identity column and a unique indexed identity column is identical! And yet Entity Framework handles them very differently.

So, in conclusion:

  • yay! finally a technical post
  • this had nothing to do with how DbContexts get disposed (since in my actual scenario I was getting this from dependency injection and so I lost hours ruling that out)
  • the error about transactions was misleading, since the issue was what closed the transaction inside the Microsoft code not whatever you did
  • the advice of some of the AggregateExceptions up the stream (An exception has been raised that is likely due to a transient failure. Consider enabling transient error resiliency by adding 'EnableRetryOnFailure' to the 'UseSqlServer' call.) was even more misleading
  • the EF support for IDENTITY columns - well, it needs it because then how would it know not to attempt to save values in those columns - is also misleading, because it doesn't mean it's good support
  • while parallel access to the DB made the problem visible, it has little to do with parallelism 
  • EF knows how to handle PRIMARY KEYs so that's the solution
  • EF sucks!

I really hope this saves time for people in the same situation!

and has 0 comments

  The Stories of My Life is the autobiography of James Patterson, said to be "the most popular storyteller of our time" of which I honestly had not heard before, written in a bunch of very short and out of order chapters, a la Mrs. Bridge, in which he repeats incessantly to outline everything. Very ironic. I liked the character more than the book.

  You see, James Patterson is a type of person that you can't help but admire: he is good at sports, he is good at school, he is good with women, he is good with business and he is a famous writer. And all of this not because anyone handed anything to him, but through hard work and dedication. This guy is the American Dream made flesh.

  He meets famous writers, actors, sports people, business people, several presidents of the United States and so on, he becomes the CEO of the advertising firm he basically interned at and all of this while being nice to people, loving and caring about family and friends and feeling pretty good about himself. And all of this without cocaine!

  So I liked the main character, very inspiring, despite the times having changed so much as to make such a person impossible nowadays, but I can't say I liked the book. The shuffled nature of the stories doesn't really help. It's clear the guy had the outline of the story he wanted to tell, so why write it this way? It didn't improve anything. Is it to clarify that life is a string of scenes and their order and the narrative we tell to ourselves are not that important compared to doing the right thing at the present time? Perhaps. But then it's inevitable that the reader is going to try to unshuffle the scenes into something comprehensible.

  And then is the always present question with an autobiography: how real is all of this? I've read some that sound real and others that feel like the prose version of "Biggest & the Best", by Clawfinger. Are there things in the artificial gaps the author creates between these anecdotes that he doesn't feel like sharing or maybe is not even aware he doesn't? Are the stories in the book overblown to inflate the author's ego? Well, I don't think so. The book actually feels right. Maybe it's not at all accurate - after all that's what a writer's job is, to make things up - but it felt honest.

  What it didn't feel was personal. You see, Patterson is a good writer, he writes with humor and wit, but I didn't feel he was writing about himself, but about this character called Jim Patterson. While honest, it also felt overpolished, the edges smoothed off, and personal is what an autobiography should feel like, something perhaps even more important than being written well.

  Bottom line: really inspiring, felt real, but also impersonal enough to not merit the full mark. I liked it.

and has 0 comments

  Fortune's Fool is something that feels like Game of Thrones, but set in a 16th century type of world inspired by Spanish and Italian history, focused on a woman ex-princess, now warrior. A lot of intrigue, world building, betrayal and feudal machinations. I didn't feel like going through with it, though. 

  It's not that I didn't like Angela Boord's writing, I just didn't feel like going through the motions with the female ingénue, betrayed by unscrupulous men, forced to see the world as it is, harden, then get betrayed again in a somewhat cathartic situation that will bring closure to her teenage trauma.

  Bottom line: I might pick it up later, if I feel like reading about feudal intrigue and cruelty, but at the moment I choose not to.