Huge SQL performance gains from using INTs

Published Sep 3, 2024

Posted in
programming
sql
picture

Just a few days ago I was writing on how important it is to tell Entity Framework what SQL type to use in order to avoid costly conversions. In fact, it wasn't so much an EF issue as it was an SQL one. Converting even character types to character types or changing collation was surprisingly expensive. In this post I will show you how important it is to choose the right type for your querying columns, especially the primary key.

First imagine this scenario: you get some data from an outside source, rows and rows of it, and you have to store them and query them in SQL. The value that uniquely identifies a row is a small string, maybe 50 characters long. How do you proceed?

My first naive solution was the most obvious one: just create a table that has a column for each value in the rows and put the primary key on the identifying one. But this leads to immediate performance losses:

by default, a primary key is a clustered index - text is not sequential, so at every insert the database engine will physically move huge swaths of data in order to place the rows in the alphabetical order of their identifiers
a primary key is a unique index - meaning text will have to get compared to other text in order to determine uniqueness, which is slow
by default, SQL is case insensitive - meaning that all text comparisons will have to be made taking into account capitalization and accents
50 characters is a lot - even without Unicode support, it's 50 bytes, which is 12 times more than an integer, meaning the primary key index will be large; and slow

"But!", you will undoubtedly say, if you put the primary key on some other column, you will still have to create a unique index on the identifier. Isn't this just pushing the problem farther down the road? The size and speed limitations will be the same. And primary keys are clustered only by default, but they can be declared as not clustered. And SQL doesn't need to be case insensitive, all you have to do is change the collation of the column to be binary and it will be compared faster. Wouldn't that solve the problem?

No. In fact, my final solution which worked five times faster, did not have an index on the identifier column AT ALL. Incidentally, I did end up changing the collation, but only because the idiots sending me the data were doing it case sensitive.

Without further ado, here is what I did:

an INT column with IDENTITY(1,1) as the primary key - which ensures a fast insertion due to the sequential nature of the value, fast query speed and low usage of disk space for the index
an INT column holding the checksum of the identifier - which when indexed, is fast to query and doesn't use a lot of disk space for the index

So how do I query on the identifier? Simple: I calculate the checksum of the string and then I look it up in the database - which uses the index to locate the few strings that have the same checksum, then just finds the right one by enumerating through them. I query on the checksum column AND the text identifier. And there is an added bonus: I only need to do this once. If I need the record from the DB again, I query it directly through the integer primary key.

Entity Framework has this automatic memory cache so when I am querying on the database entity using a business model - as good separation of concerns practice would dictate - it gets it really fast from memory. Because the memory cache also uses just the int to identify an entity, which means double the benefits!

The eagle eyed reader will have noticed that I am not using a unique index on the identifier, so technically I could create multiple rows with the same one. However, my application is always looking for the existing record first. But if you really worry about data consistency, the index on the checksum column can be replaced with a unique index on the checksum and identifier column. It will take more space, but it will be just as fast.

Another thing that you may have noticed is that I use a code checksum, not the database provided functions to achieve the same. At first glance, it's an instant win: just create a persisted computed column that calculates the checksum or binary checksum of the identifier column. However, this would be weird when having to query, since you would have to craft a stored procedure or a custom SQL command to get the identifier and query on its checksum. In my case I just calculate a checksum - and not use the lazy string.GethashCode function which may be subject to change and it's already different between 32 and 64 bit systems.

Of course, if you want your text columns to be case and/or accent insensitive, you will have to store the hash code of the lowercase and unaccented string or use an implementation that is case and accent insensitive. This may not be trivial.

Further tests showed that just using a non clustered index on the identifier column, even a unique one, was just slightly slower, maybe 5%. However, the space taken by indexes increased by 20%. So I might understand why you would find it a bit off putting and skip the checksum part.

Hope this helps!

P.S. Why did this solution provide such a huge performance gain? Obviously the SQL team would have implemented a sort of checksum for their text index, this should have been working natively and faster than any possible implementation I could make. Well, I don't know the answer. In fact, this all could be some quirk of Entity Framework and the SQL queries would not be optimizable to such a degree. I will attempt to test that using purely SQL commands. But meanwhile, all the points I made above are valid and with a little more work you can have a lot more control on how the system works.

EF Core 6 - correct types halving the execution time!

Published Aug 30, 2024

and has 0 comments

I've built an application and, like any lazy dev out there, I focused on the business logic, the project structure, the readability, comments, the dependency injection, the unit tests, you know... the code. My preference is to start from top to bottom, so I create more and more detailed implementations of interfaces while going down to the metal. The bottom of this chain is the repository, that class which handles database access, and I've spent little to understand or optimize that code. I mean, it's DB access, you read or you write stuff, how difficult can it be?

When it was time to actually test it, the performance of the application was unexpectedly bad. I profiled it and I was getting reasonable percentages for different types of code, but it was all taking too long. And suddenly my colleague says "well, I tried a few things and now it works twice as fast". Excuse me?! You did WHAT?! I have been trying a few things too, and managed to do diddly squat! Give me that PR to see what you did! And... it was nothing I could see.

He didn't change the code, he just added or altered the attributes decorating the properties of models. That pissed me off, because I had previously gone to the generated SQL with the SQL Profiler and it was all OK. So I executed my code and his code and recorded the SQL that came out:

was it the lazy loading? Nope. The number of instructions and their order was exactly the same
was it the explicit declaration of the names of indexes and foreign keys? Nope. Removing those didn't affect performance.
was it the ChangeTracker.LazyLoadingEnabled=false thing? Nope, I wasn't using child entities in a way that could be affected.
was there some other structure of the generated SQL? No. It was exactly the same SQL! Just my code was using thousands of CPU units and his was using none.
was it magic? Probably, because it made no sense whatsoever! Except...

Entity Framework generates simple SQL queries, but it doesn't execute them as you and I would. It constructs a string, then uses sp_executesql to run it. Something like this:

exec sp_executesql N'SELECT TOP(1) [p].[ID], [p].[TXT], [p].[LUP_TS]

FROM [sch].[table] AS [p]

WHERE [p].[ID] = @__p_0',N'@__p_0 nvarchar(64)',@__p_0='xxxx'

Do you see it? I didn't until I started to compare the same SQL in the two versions. And it was the type of the parameters! Note that the aptly named parameter @__p_0 is an NVARCHAR. The actual column in the database was VARCHAR! Meaning that the code above was unnecessarily always converting values in order to compare them. The waste of resources was staggering!

How do you declare the exact database type of your columns? Multiple ways. In my case there were three different problems:

no Unicode(false) attribute on the string columns - meaning EF expected the columns to be NVARCHAR
no Typename parameter in the Column attribute where the columns were NTEXT - meaning EF expected them to be NVARCHAR(Max)
- I guess one could skip the Unicode thing and instead just specify the type name, but I haven't tested it
using MaxLength instead of StringLength - because even if their descriptions are very similar and MaxLength sounds like applying in more cases, it's StringLength that EF wants.

From 40-50ms per processing loop, it dropped to 21ms just by fixing these.

Long story short: parametrized SQL executed with sp_executesql hides a possible performance issue if the columns that you compare or extract have slightly different types than the one of the parameters.

Go figure. I hate Entity Framework!