and has 0 comments

C# 3.0 introduced Object Initializer Syntax which was a game changer for code that created new objects or new collections. Here is a contrived example:

var obj = new ComplexObject
{
    // object initializer
    AnItem = new Item("text1"),
    AnotherItem = new Item("text2"),
    // collection initializer
    RestOfItems = new List<Item>
    {
        new Item("text3"),
        new Item("text4"),
        new Item("text5")
    },
    // indexer initializer
    [0]=new Item("text6"),
    [1]=new Item("text7")
};

Before this syntax was available, the same code would have looked like this:

var obj = new ComplexObject();
obj.AnItem = new Item("text1");
obj.AnotherItem = new Item("text2");
obj.RestOfItems = new List<Item>();
obj.RestOfItems.Add(new Item("text3"));
obj.RestOfItems.Add(new Item("text4"));
obj.RestOfItems.Add(new Item("text5"));
obj[0] = new Item("text6");
obj[2] = new Item("text7");

It's not like the number of lines has changed, but both the writability and readability of the code increase with the new syntax. At least that's why I think. However, outside these very simple scenarios, the feature feels like it's encumbering us or that it is missing something. Imagine you want to only add items to a list based on some condition. You might get a code like this:

var list = new List<Item>
{
    new Item("text1")
};
if (condition) list.Add(new Item("text2"));

We use the initializer for one item, but not for the other. We might as well use Add for both items, then, or use some cumbersome syntax that hurts more than it helps:

var list = new[]
{
    new Item("text1"),
    condition?new Item("text2"):null
}
.Where(i => i != null)
.ToList();

It's such an ugly syntax that Visual Studio doesn't know how to indent it properly. What to do? Software patterns to the rescue! 

Seriously now, people who know me know that I scoff at the general concept of software patterns, but the patterns themselves are useful and in this case, even the conceptual framework that I often deride is useful here. Because we are trying to initialize an object or a collection, which means we are attempting to build it. So why not use a Builder pattern? Here are two versions of the same code, one with extension methods (which can be used everywhere, but might pollute the member list for common objects) and another with an actual builder object created specifically for our purposes (which may simplify usage):

// extension methods
var list = new List<Item>()
    .Adding(new Item("text1"))
    .ConditionalAdding(condition, new Item("text2"));
...
public static class ItemListExtensions
{
    public static List<T> Adding<T>(this List<T> list, T item)
    {
        list.Add(item);
        return list;
    }
    public static List<T> ConditionalAdding<T>(this List<T> list, bool condition, T item)
    {
        if (condition)
        {
            list.Add(item);
        }
        return list;
    }
}

// builder object
var list = new ItemListBuilder()
    .Adding("text1")
    .ConditionalAdding(condition, "text2")
    .Build();
...
public class ItemListBuilder
{
    private readonly List<Item> list;

    public ItemListBuilder()
    {
        list = new List<Item>();
    }

    public ItemListBuilder Adding(string text)
    {
        list.Add(new Item(text));
        return this;
    }

    public ItemListBuilder ConditionalAdding(bool condition, string text)
    {
        if (condition)
        {
            list.Add(new Item(text));
        }
        return this;
    }

    public List<Item> Build()
    {
        return list.ToList();
    }
}

Of course, for just a simple collection with some conditions this might feel like overkill, but try to compare the two versions of the code: the one that uses initializer syntax and then the Add method and the one that declares what it wants to do, step by step. Also note that in the case of the builder object I've taken the liberty of creating methods that only take string parameters then build the list of Item, thus simplifying the syntax and clarifying intent.

I had this situation where I had to map an object to another object by copying some properties into collections and values of some type to other types and so on. The original code was building the output using a top-down approach:

public Output BuildOutput(Input input) {
  var output=new Output();
  BuildFirstPart(output, input);
  BuildSecondPart(output, input);
  ...
  return output;
}

public BuildFirstPart(Output output, Input input) {
  var firstSection = BuildFirstSection(input);
  output.FirstPart=new List<Part> {
    new Part(firstSection)
  };
  if (condition) {
    var secondSection=BuildSeconfSection(input);
    output.FirstPart.Add(new Part(secondSection));
  }
}

And so on and so on. I believe that in this case a fluent design makes the code a lot more readable:

var output = new Output {
  FirstPart = new List<Part>()
    .Adding(BuildFirstSection(input))
    .ConditionalAdding(condition, BuildSecondSection(input),
  SecondPart = ...
};

The "build section" methods would also be inlined and replaced with fluent design methods. In this way the structure of "the output" is clearly shown in a method that declares what it will build and populates the various members of the Output class with simple calculations, the only other methods that the builder needs. A human will understand at a glance what thing it will build, see its structure as a tree of code and be able to go to individual methods to see or change the specific calculation that provides a value.

The point of my post is that sometimes the very out-of-the-box features that help us a lot most of the time end up complicating and obfuscating our code in specific situations. If the code starts to smell, to become unreadable, to make you feel bad for writing it, then stop, think of a better solution, then implement it so that it is the best version for your particular case. Use tools when they are useful and discard them when other solutions might prove more effective.

and has 1 comment

Intro

  Some of the most visited posts on this blog relate to dependency injection in .NET. As you may know, dependency injection has been baked in in ASP.Net almost since the beginning, but it culminated with the MVC framework and the .Net Core rewrite. Dependency injection has been separated into packages from where it can be used everywhere. However, probably because they thought it was such a core concept or maybe because it is code that came along since the days of UnityContainer, the entire mechanism is sealed, internalized and without any hooks on which to add custom code. Which, in my view, is crazy, since dependency injection serves, amongst other things, the purpose of one point of change for class instantiations.

  Now, to be fair, I am not an expert in the design patterns used in dependency injection in the .NET code. There might be some weird way in which you can extend the code that I am unaware of. In that case, please illuminate me. But as far as I went in the code, this is the simplest way I found to insert my own hook into the resolution process. If you just want the code, skip to the end.

Using DI

  First of all, a recap on how to use dependency injection (from scratch) in a console application:

// you need the nuget packages Microsoft.Extensions.DependencyInjection 
// and Microsoft.Extensions.DependencyInjection.Abstractions
using Microsoft.Extensions.DependencyInjection;
...

// create a service collection
var services = new ServiceCollection();
// add the mappings between interface and implementation
services.AddSingleton<ITest, Test>();
// build the provider
var provider = services.BuildServiceProvider();

// get the instance of a service
var test = provider.GetService<ITest>();

  Note that this is a very simplified scenario. For more details, please check Creating a console app with Dependency Injection in .NET Core.

Recommended pattern for DI

  Second of all, a recap of the recommended way of using dependency injection (both from Microsoft and myself) which is... constructor injection. It serves two purposes:

  1. It declares all the dependencies of an object in the constructor. You can rest assured that all you would ever need for that thing to work is there.
  2. When the constructor starts to fill a page you get a strong hint that your class may be doing too many things and you should split it up.

  But then again, there is the "Learn the rules. Master the rules. Break the rules" concept. I've familiarized myself with it before writing this post so that now I can safely break the second part and not master anything before I break stuff. I am talking now about property injection, which is generally (for good reason) frowned upon, but which one may want to use in scenarios adjacent to the functionality of the class, like logging. One of the things that always bothered me is having to declare a logger in every constructor ever, even if in itself a logger does nothing to the functionality of the class.

  So I've had this idea that I would use constructor dependency injection EVERYWHERE, except logging. I would create an ILogger<T> property which would be automatically injected with the correct implementation at resolution time. Only there is a problem: Microsoft's dependency injection does not support property injection or resolution hooks (as far as I could find). So I thought of a solution.

How does it work?

  Third of all, a small recap on how ServiceProvider really works.

  When one does services.BuildServiceProvider() they actually call an extension method that does new ServiceProvider(services, someServiceProviderOptions). Only that constructor is internal, so you can't use it yourself. Then, inside the provider class, the GetService method is using a ConcurrentDictionary of service accessors to get your service. In case the service accessor is not there, the method from the field _createServiceAccessor is going to be used. So my solution: replace the field value with a wrapper that will also execute our own code.

The solution

  Before I show you the code, mind that this applies to .NET 7.0. I guess it will work in most .NET Core versions, but they could change the internal field name or functionality in which case this might break.

  Finally, here is the code:

public static class ServiceProviderExtensions
{
    /// <summary>
    /// Adds a custom handler to be executed after service provider resolves a service
    /// </summary>
    /// <param name="provider">The service provider</param>
    /// <param name="handler">An action receiving the service provider, 
    /// the registered type of the service 
    /// and the actual instance of the service</param>
    /// <returns>the same ServiceProvider</returns>
    public static ServiceProvider AddCustomResolveHandler(this ServiceProvider provider,
                 Action<IServiceProvider, Type, object> handler)
    {
        var field = typeof(ServiceProvider).GetField("_createServiceAccessor",
                        BindingFlags.Instance | BindingFlags.NonPublic);
        var accessor = (Delegate)field.GetValue(provider);
        var newAccessor = (Type type) =>
        {
            Func<object, object> newFunc = (object scope) =>
            {
                var resolver = (Delegate)accessor.DynamicInvoke(new[] { type });
                var resolved = resolver.DynamicInvoke(new[] { scope });
                handler(provider, type, resolved);
                return resolved;
            };
            return newFunc;
        };
        field.SetValue(provider, newAccessor);
        return provider;
    }
}

  As you can see, we take the original accessor delegate and we replace it with a version that runs our own handler immediately after the service has been instantiated.

Populating a Logger property

  And we can use it like this to do property injection now:

static void Main(string[] args)
{
    var services = new ServiceCollection();
    services.AddSingleton<ITest, Test>();
    var provider = services.BuildServiceProvider();
    provider.AddCustomResolveHandler(PopulateLogger);

    var test = (Test)provider.GetService<ITest>();
    Assert.IsNotNull(test.Logger);
}

private static void PopulateLogger(IServiceProvider provider, 
                                    Type type, object service)
{
    if (service is null) return;
    var propInfo = service.GetType().GetProperty("Logger",
                    BindingFlags.Instance|BindingFlags.Public);
    if (propInfo is null) return;
    var expectedType = typeof(ILogger<>).MakeGenericType(service.GetType());
    if (propInfo.PropertyType != expectedType) return;
    var logger = provider.GetService(expectedType);
    propInfo.SetValue(service, logger);
}

  See how I've added the PopulateLogger handler in which I am looking for a property like 

public ILogger<Test> Logger { get; private set; }

  (where the generic type of ILogger is the same as the class) and populate it.

Populating any decorated property

  Of course, this is kind of ugly. If you want to enable property injection, why not use an attribute that makes your intention clear and requires less reflection? Fine. Let's do it like this:

// Add handler
provider.AddCustomResolveHandler(InjectProperties);
...

// the handler populates all properties that are decorated with [Inject]
private static void InjectProperties(IServiceProvider provider, Type type, object service)
{
    if (service is null) return;
    var propInfos = service.GetType()
        .GetProperties(BindingFlags.Instance | BindingFlags.Public)
        .Where(p => p.GetCustomAttribute<InjectAttribute>() != null)
        .ToList();
    foreach (var propInfo in propInfos)
    {
        var instance = provider.GetService(propInfo.PropertyType);
        propInfo.SetValue(service, instance);
    }
}
...

// the attribute class
[AttributeUsage(AttributeTargets.Property, AllowMultiple = false, Inherited = true)]
public class InjectAttribute : Attribute {}

Conclusion

I have demonstrated how to add a custom handler to be executed after any service instance is resolved by the default Microsoft ServiceProvider class, which in turn enables property injection, one point of change to all classes, etc. I once wrote code to wrap any class into a proxy that would trace all property and method calls with their parameters automatically. You can plug that in with the code above, if you so choose.

Be warned that this solution is using reflection to change the functionality of the .NET 7.0 ServiceProvider class and, if the code there changes for some reason, you might need to adapt it to the latest functionality.

If you know of a more elegant way of doing this, please let me know.

Hope it helps!

Bonus

But what about people who really, really, really hate reflection and don't want to use it? What about situations where you have a dependency injection framework running for you, but you have no access to the service provider builder code? Isn't there any solution?

Yes. And No. (sorry, couldn't help myself)

The issue is that ServiceProvider, ServiceCollection and all that jazz are pretty closed up. There is no solution I know of that solved this issue. However... there is one particular point in the dependency injection setup which can be hijacked and that is... the adding of the service descriptors themselves!

You see, when you do ServiceCollection.AddSingleton<Something,Something>, what gets called is yet another extension method, the ServiceCollection itself is nothing but a list of ServiceDescriptor. The Add* extensions methods come from ServiceCollectionServiceExtensions class, which contains a lot of methods that all defer to just three different effects:

  • adding a ServiceDescriptor on a type (so associating an type with a concrete type) with a specific lifetime (transient, scoped or singleton)
  • adding a ServiceDescriptor on an instance (so associating a type with a specific instance of a class), by default singleton
  • adding a ServiceDescriptor on a factory method (so associating a type with a constructor method)

If you think about it, the first two can be translated into the third. In order to instantiate a type using a service provider you do ActivatorUtilities.CreateInstance(provider, type) and a factory method that returns a specific instance of a class is trivial.

So, the solution: just copy paste the contents of ServiceCollectionServiceExtensions and make all of the methods end up in the Add method using a service factory method descriptor. Now instead of using the extensions from Microsoft, you use your class, with the same effect. Next step: replace the provider factory method with a wrapper that also executes stuff.

Since this is a bonus, I let you implement everything except the Add method, which I will provide here:

// original code
private static IServiceCollection Add(
    IServiceCollection collection,
    Type serviceType,
    Func<IServiceProvider, object> implementationFactory,
    ServiceLifetime lifetime)
{
    var descriptor = new ServiceDescriptor(serviceType, implementationFactory, lifetime);
    collection.Add(descriptor);
    return collection;
}

//updated code
private static IServiceCollection Add(
    IServiceCollection collection,
    Type serviceType,
    Func<IServiceProvider, object> implementationFactory,
    ServiceLifetime lifetime)
{
    Func<IServiceProvider, object> factory = (sp)=> {
        var instance = implementationFactory(sp);
        // no stack overflow, please
        if (instance is IDependencyResolver) return instance;
        // look for a registered instance of IDependencyResolver (our own interface)
        var resolver=sp.GetService<IDependencyResolver>();
        // intercept the resolution and replace it with our own 
        return resolver?.Resolve(sp, serviceType, instance) ?? instance;
    };
    var descriptor = new ServiceDescriptor(serviceType, factory, lifetime);
    collection.Add(descriptor);
    return collection;
}

All you have to do is (create the interface and then) register your own implementation of IDependencyResolver and do whatever you want to do in the Resolve method, including the logger instantiation, the inject attribute handling or the wrapping of objects, as above. All without reflection.

The kick here is that you have to make sure you don't use the default Add* methods when you register your services, or this won't work. 

There you have it, bonus content not found on dev.to ;)

and has 0 comments

  So I was happily minding my own business after a production release only for everything to go BOOM! Apparently, maybe because of something we did, but maybe not, the memory of the production servers was running out. Exception looked something like:

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
 at System.Reflection.Emit.TypeBuilder.SetMethodIL(RuntimeModulemodule, Int32tk ,BooleanisInitLocals, Byte[] body, Int32 bodyLength, 
    Byte[] LocalSig ,Int32sigLength, Int32maxStackSize, ExceptionHandler[] exceptions, Int32numExceptions ,Int32[] tokenFixups, Int32numTokenFixups)
 at System.Reflection.Emit.TypeBuilder.CreateTypeNoLock() 
 at System.Reflection.Emit.TypeBuilder.CreateType()
 at System.Xml.Serialization.XmlSerializationReaderILGen.GenerateEnd(String []methods, XmlMapping[] xmlMappings, Type[] types) 
 at System.Xml.Serialization.TempAssembly.GenerateRefEmitAssembly(XmlMapping []xmlMappings, Type[] types, StringdefaultNamespace ,Evidenceevidence)
 at System.Xml.Serialization.TempAssembly..ctor(XmlMapping []xmlMappings, Type[] types, StringdefaultNamespace ,Stringlocation, Evidenceevidence)
 at System.Xml.Serialization.XmlSerializer.GenerateTempAssembly(XmlMappingxmlMapping, Typetype ,StringdefaultNamespace, Stringlocation, Evidence evidence)
 at System.Xml.Serialization.XmlSerializer..ctor(Typetype, XmlAttributeOverrides overrides, Type[] extraTypes, 
     XmlRootAttributeroot, StringdefaultNamespace, Stringlocation, Evidence evidence)
 at System.Xml.Serialization.XmlSerializer..ctor(Typetype, XmlAttributeOverrides overrides) 

At first I thought there was something else eating away the memory, but the exception was repeatedly thrown at this specific point. And I did what every senior dev does: googled it! And I found this answer: "When an XmlSerializer is created, an assembly is dynamically generated and loaded into the AppDomain. These assemblies cannot be garbage collected until their AppDomain is unloaded, which in your case is never." It also referenced a Microsoft KB886385 from 2007 which, of course, didn't exist at that URL anymore, but I found it archived by some nice people.

What was going on? I would tell you, but Gergely Kalapos explains things much better in his article How the evil System.Xml.Serialization.XmlSerializer class can bring down a server with 32Gb ram. He also explains what commands he used to debug the issue, which is great!

But since we already know links tend to vanish over time (so much for stuff on the Internet living forever), here is the gist of it all:

  • XmlSerializer generates dynamic code (as dynamic assemblies) in its constructors
  • the most used constructors of the class have a caching mechanism in place:
    • XmlSerializer.XmlSerializer(Type)
    • XmlSerializer.XmlSerializer(Type, String)
  • but the others do not, so every time you use one of those you create, load and never unload another dynamic assembly

I know this is an old class in an old framework, but some of us still work in companies that are firmly rooted in the middle ages. Also since I plan to maintain my blog online until I die, it will live on the Internet for the duration.

Hope it helps!

  There is a common task in Excel that seems should have a very simple solution. Alas, when googling for it you get all these inexplainable crappy "tutorial" sites that either show you something completely different or something that you cannot actually do because you don't have the latest version of Office. Well, enough of this!

  The task I am talking about is just selecting a range of values and concatenating them using a specified separator, what in a programming language like C# is string.Join or in JavaScript you get the array join function. I find it very useful when, for example, I copy a result from SQL and I want to generate an INSERT or UPDATE query. And the only out of the box solution is available for Office 365 alone: TEXTJOIN.

  You use it like =TEXTJOIN(", ", FALSE, A2:A8) or =TEXTJOIN(", ", FALSE, "The", "Lazy", "Fox"), where the parameters are:

  • a delimiter
  • a boolean to determine if empty cells are ignored
  • a series or text values or a range of cells

  But, you can have this working in whatever version of Excel you want by just using a User Defined Function (UDF), one specified in this lovely and totally underrated Stack Overflow answer: MS Excel - Concat with a delimiter.

  Long story short:

  • open the Excel sheet that you want to work on 
  • press Alt-F11 which will open the VBA interface
  • insert a new module
  • paste the code from the SO answer (also copy pasted here, for good measure)
  • press Alt-Q to leave
  • if you want to save the Excel with the function in it, you need to save it as a format that supports macros, like .xlsm

And look at the code. I mean, it's ugly, but it's easy to understand. What other things could you implement that would just simplify your work and allow Excel files to be smarter, without having to code an entire Excel add-in? I mean, I could just create my own GenerateSqlInsert function that would handle column names, NULL values, etc. 

Here is the TEXTJOIN mimicking UDF to insert in a module:

Function TEXTJOIN(delim As String, skipblank As Boolean, arr)
    Dim d As Long
    Dim c As Long
    Dim arr2()
    Dim t As Long, y As Long
    t = -1
    y = -1
    If TypeName(arr) = "Range" Then
        arr2 = arr.Value
    Else
        arr2 = arr
    End If
    On Error Resume Next
    t = UBound(arr2, 2)
    y = UBound(arr2, 1)
    On Error GoTo 0

    If t >= 0 And y >= 0 Then
        For c = LBound(arr2, 1) To UBound(arr2, 1)
            For d = LBound(arr2, 1) To UBound(arr2, 2)
                If arr2(c, d) <> "" Or Not skipblank Then
                    TEXTJOIN = TEXTJOIN & arr2(c, d) & delim
                End If
            Next d
        Next c
    Else
        For c = LBound(arr2) To UBound(arr2)
            If arr2(c) <> "" Or Not skipblank Then
                TEXTJOIN = TEXTJOIN & arr2(c) & delim
            End If
        Next c
    End If
    TEXTJOIN = Left(TEXTJOIN, Len(TEXTJOIN) - Len(delim))
End Function

Hope it helps!

  I haven't been working on the Sift string distance algorithm for a while, but then I was reminded of it because someone wanted it to use it to suggest corrections to user input. Something like Google's: "Did you mean...?" or like an autocomplete application. And it got me thinking of ways to use Sift for bulk searching. I am still thinking about it, but in the meanwhile, this can be achieved using the Sift4 algorithm, with up to 40% improvement in speed to the naïve comparison with each item in the list.

  Testing this solution, I've realized that the maxDistance parameter did not work correctly. I apologize. The code is now fixed on the algorithm's blog post, so go and get it.

  So what is this solution for mass search? We can use two pieces of knowledge about the problem space:

  • the minimum possible distance between two string of length l1 and l2 will always abs(l1-l2)
    • it's very easy to understand the intuition behind it: one cannot generate a string of size 5 from a string of size 3 without at least adding two new letters, so the minimum distance would be 2
  • as we advance through the list of strings, we have a best distance value that we keep updating
    • this molds very well on the maxDistance option of Sift4

  Thus armed, we can find the best matches for our string from a list using the following steps:

  1. set a bestDistance variable to a very large value
  2. set a matches variable to an empty list
  3. for each of the strings in the list:
    1. compare the minimum distance between the search string and the string in the list (abs(l1-l2)) to bestDistance
      1. if the minimum distance is larger than bestDistance, ignore the string and move to the next
    2. use Sift4 to get the distance between the search string and the string in the list, using bestDistance as the maxDistance parameter
      1. if the algorithm reaches a temporary distance that is larger than bestDistance, it will break early and report the temporary distance, which we will ignore
    3. if distance<bestDistance, then clear the matches list and add the string to it, updating bestDistance to distance
    4. if distance=bestDistance, then add the string to the list of matches

  When using the common Sift4 version, which doesn't compute transpositions, the list of matches is retrieved 40% faster on average than simply searching through the list of strings and updating the distance. (about 15% faster with transpositions) Considering that Sift4 is already a lot faster than Levenshtein, this method will allow searching through hundreds of thousands of strings really fast. The gained time can be used to further refine the matches list using a slower, but more precise algorithm, like Levenshtein, only on a lot smaller set of possible matches.

  Here is a sample written in JavaScript, where we search a random string in the list of English words:

search = getRandomString(); // this is the search string
let matches=[];             // the list of found matches
let bestDistance=1000000;   // the smaller distance to our search found so far
const maxOffset=5;          // a common value for searching similar strings
const l = search.length;    // the length of the search string
for (let word of english) {
    const minDist=Math.abs(l-word.length); // minimum possible distance
    if (minDist>bestDistance) continue;    // if too large, just exit
    const dist=sift4(search,word,maxOffset,bestDistance);
    if (dist<bestDistance) {
        matches = [word];                  // new array with a single item
        bestDistance=dist;
        if (bestDistance==0) break;        // if an exact match, we can exit (optional)
    } else if (dist==bestDistance) {
        matches.push(word);                // add the match to the list
    }
}

  There are further optimizations that can be added, beyond the scope of this post:

  • words can be grouped by length and the minimum distance check can be done on entire buckets of strings of the same lengths
  • words can be sorted, and when a string is rejected as a match, reject all string with the same prefix
    • this requires an update of the Sift algorithm to return the offset at which it stopped (to which the maxOffset must be added)

  I am still thinking of performance improvements. The transposition table gives more control over the precision of the search, but it's rather inefficient and resource consuming, not to mention adding code complexity, making the algorithm harder to read. If I can't find a way to simplify and improve the speed of using transpositions I might give up entirely on the concept. Also, some sort of data structure could be created - regardless of how much time and space is required, assuming that the list of strings to search is large and constant and the number of searches will be very big.

  Let me know what you think in the comments!

  Today I had a very interesting discussion with a colleague who optimized my work in Microsoft's SQL Server by replacing a table variable with a temporary table. Which is annoying, since I've done the opposite plenty of time, thinking that I am choosing the best solution. After all, temporary tables have the overhead of being stored into tempdb, on the disk. What could possibly be wrong with using a table variables? I believe this table explains it all:

First of all, the storage is the same. How? Well, table variables start off in memory, but if they go above a limit they get saved to tempdb! Another interesting bit is the indexes. While you can create primary keys on table variables, you can't use other indexes - that's OK, though, because you would hardly need very complex variable tables. But then there is the parallelism: none for table variables! As you will see, that's rather important. At least table variables don't cause recompilations. And last, but certainly not least, perhaps the most important difference: statistics! You don't have statistics on table variables.

Let's consider my scenario: I was executing a stored procedure and storing the selected values in a table variable. This SP had the single reason to filter the ids of records that I would then have to extract - joining them with a lot of other tables - and could return 200, 800 or several hundred thousand rows.

With a table variable this means :

  1. when inserting potentially hundreds of thousands of rows I would have no parallelism (slow!) and it would probably save it to tempdb anyway (slow!)
  2. when joining other tables with it, not having statistics, it would just treat it like a short list of values, which it potentially wasn't, and looping through it : Table Spool (slow!)
  3. various profiling tools would show the same or even less physical reads and the same SQL server execution time, but the CPU time would be larger than execution time (hidden slow!)

This situation has been improved considerably in SQL Server 2019, to the point that in most cases table variables and temporary tables show the same performance, but versions previous to that would show this to a larger degree.

And then there are hacks. For my example, there is reason why parallelism DOES occur:

So are temporary tables always better? No. There are several advantages of table variables:

  1. they get cleared automatically at the end of their scope
  2. result in fewer recompilations of stored procedures
  3. less locking and resources, since they don't have transaction logs

For many simple situations, like where you want to generate some small quantity of data and then work with that, table variables are best. However, as soon as the data size or scenario complexity increases, temporary tables become better.

As always, don't believe me, test! In SQL everything "depends", you can't rely on fixed rules like "X is always better" so profile your particular scenarios and see which solution is better.

Hope it helps!

  I had this situation where I was trying to optimize a query. And after some investigation I've stumbled upon something strange: querying on the primary key was generating a lot of reads. I was joining my table with a temporary table of 10 ids and there were 630 reads! How come?

  At first I thought it was because the way indexes work. The primary key was comprised of RowId and RowDate and, even if I knew theoretically searching by RowId should use the primary key, the evidence was against me: when querying by RowId and RowDate I would get the expected 10 reads.

  I created two queries, one with and one without RowDate. I then compared their execution plans. They were identical! Only one took a lot longer, specifically in the Index Seek (which used correctly the primary key). When I looked at the properties for that plan element, I saw something strange:

Actual Partitions Accessed 1..63!

I then realized that the table was partitioned on the RowDate column. In this case, RowDate takes precedence to any indexed column! You might think of partitioning a table like forcefully adding the partition columns to every index in the table, including the primary key. In fact, a partitioned table acts like a number of separate tables with the same definition (columns, indexes, etc.), just different data. The indexes work on each separate partition. When you partition a table, you also partition its indexes.

In truth, I would have expected the query execution plan to show the partition split as a separate step. I understand it's hard to conceptualize it without creating as many execution paths as there are partitions, but still, there should be an indication in the shape of the plan that makes it clear you are querying on multiple partitions.

Once RowDate was used, the SQL engine would choose the one partition of my row, then use the primary key index to seek it. Instead of 63*10 reads, just 10 reads, the number of the rows in the id table.

So be careful when you use table partitioning to ALWAYS use the partition columns in the queries for the table, else you will get as many parallel searches as there are partitions, regardless of the indexes you created, as they are also partitioned.

Hope that helps!

  This is a very basic tutorial on how to access Microsoft SQL Server data via SQL queries. Since these are generic concepts, they will be applicable in most other SQL variants out there. My hope is that it will provide the necessary tools to quickly "get into it" without having to read (or understand) too much. Where you go from there is on you.

  There are a lot of basic concepts about SQL, this post will be pretty long.

Table of contents

Connecting to a database

  Let's start with tooling. To access a database you will need SQL Server Management Studio, in my case version 2022, but I will not do anything complicated with it here, therefore any version will do just fine. I will assume you have it installed already as installation is beyond the scope of the blog post. Starting it will prompt for a connection:

  To connect to the local computer, the server will be either . or (local) or the computer name. You can of course connect to any server and you can specify the "instance" and the port number as well. An instance is a specific named installation of SQL server which allows one to have multiple installations (and even versions) of SQL Server. In fact, each instance has its own port, so specifying the port number will ignore the name of the instance. The default port is usually 1433.

  Example of connection server strings: Computer1\SQLEXPRESS, sql.corporate.com,1433, (local), .

  The image here is from a connection to the local machine using Windows Authentication (your windows user). You can connect using SQL Server Authentication, which means providing a username and a password, or using one of the more modern Azure Active Directory methods.

  I will also assume that the connection parameters are known to you, so let's go to the next step.

  Once connected, the Object Explorer window will display the connection you've opened.

  Expanding the Databases node will show the available databases.

  Expanding a database node we get the objects that are part of the database, the most important being:

  • Tables - where the actual data resides
  • Views - abstractions over more complex queries that behave like tables as much as possible, but with some restrictions
  • Stored Procedures - SQL code that can be executed with parameters and may return data results
  • Functions - SQL code that can be executed and returns a value (which can be scalar, like a number of string, or a table type, etc.) 

  In essence they are the equivalent of data stores and code that is executed to use those stores. Views, SPs and functions will not be explained in this post, but feel free to read about them afterwards.

  If one expands a table node, the child nodes will contains various things, the most important of which are:

  • Columns - the names and types of each column in the table
  • Indexes - data structures designed to increase performance to various ways of accessing the data in the table
  • Constraints and Keys - logical restrictions and relationships between tables

  Tables are kind of like Excel sheets, they have rows (data records) and columns (record properties). The power of SQL is a way to declare what you want from tabular representations of data and get the results quickly and efficiently.

  Last thing I want to show from the graphical interface is right clicking on a table node, which shows multiple options, including generating simple operations on the table, the CRUD (Create, Read, Update, Delete) operations mostly, which in SQL are called INSERT, SELECT, UPDATE and DELETE respectively.

  The keywords are traditionally written in all caps, I am not shouting at you. Depending on your preferences and of course the coding standards that apply to your project you can capitalize SQL code however you like. SQL is case insensitive.

Anyway, whatever you are going to choose to "script" it's going to open a so called query window and show you a text with the query. You then have the option of executing it. Normally no one uses the UI to generate scripts except for getting the column names in order for SELECT or INSERT operations. Most of the time you will just right click on a database and choose New Query or select a database and press Ctrl-N, with the same result.

Getting data from tables

Finally we get to doing something. The operation to read data from SQL is called SELECT. One can specify the columns to be returned or just use * to get them all. It is good practice to always specify the column names in production code, even if you intend to select all columns, as the output of the query will not change if we add more columns in the future. However, we will not be discussing software projects, just how to get or change the data using SQL server, so let's get to it.

The simplest select query is: SELECT * FROM MyTable, which will return all columns of all records of the table. Note that MyTable is the name of a table and the least specific way of accessing that table. The same query can be written as: SELECT * FROM [MyDatabase].[dbo].[MyTable], specifying the database name, the schema name (default one is dbo, but your database can use multiple ones) and only then the table name.

The square bracket syntax is usually not required, but might be needed in special cases, like when a column has the same name as a keyword or if an object has spaces or commas in it (never a good idea, but a distinct possibility), for example: SELECT [Stupid,column] FROM [Stupid table name with spaces]. Here we are selecting a badly named column from a badly named table. Removing the square brackets would result in a syntax error.

In the example above we selected stuff from table CasesSince100 and we got tabular results for every record and the columns defined in the table. But that is not really useful. What we want to do when getting data is:

  • getting data from specific columns
  • formatting the data for our purposes
  • filtering the data on conditions
  • grouping the data
  • ordering the results

So here is a more complex query:

-- everything after two dashes in a line is a comment, ignored by the engine
/* there is also
   a multiline comment syntax */
SELECT TOP 10                            -- just the first 10 records
    c.Entity as Country,                 -- Entity will be returned with the name Country
    CAST(c.[Date] as Date) as [Date],    -- Unfortunate naming, as Date is also a type
    c.cases as Cases                     -- capitalized alias
FROM CasesSince100 c                     -- source for the data, aliased as 'c'
WHERE c.Code='ROU'                       -- conditions to filter by
    AND c.[Date]>'2020-03-01'
ORDER BY c.[Date] DESC                   -- ordering in descending order

  The query above will return at most 10 rows, only for Romania, for dates larger than March 2020, but ordered from the newest to oldest. Data returned will be the country name, the date (which was originally a DATETIME and now is cast to a timeless DATE type) and the number of cases.

  Note that I have aliased all columns, so the resulting table has columns named as the aliases. I've also aliased the table name as 'c', which helps in several ways. First of all, Intellisense works better and faster when specifying the table name. All you have to do is type c. and the list of columns will pop up and be filtered as you type. The second reason will become apparent when I am talking about updating and deleting. For the moment just remember that it's a good idea to alias your tables.

  You can alias a table by specifying a name to call it by next to its own name and optionally using 'as', like SELECT ltn.* FROM Schema.LongTableName as ltn. It helps differentiating between ambiguous names (like if two joined tables have columns with the same name), simplifying the code for long named tables and helping with code completion. Even when aliased, the table name can be used and one can specify or ignore the name of the table if the column names are unambiguous.

Of course these are trivial examples. The power of SQL is that you can get information from multiple sources, aggregate them and structure your database for quick access. More advanced concepts are JOINs and indexes, and I hope you will read until I get there, but for now let's just go through the very basics.

Here is another query that groups and aggregates data:

SELECT TOP 10                            -- top 10 results
    c.Entity as Country,                 -- country name
    SUM(CAST(c.cases as INT)) as Cases   -- cases is text, so we transform it to int
FROM CasesSince100 c
WHERE YEAR([Date])=2020                  -- condition applies a function to the date
GROUP BY c.Entity                        -- groups by country
HAVING SUM(CAST(c.cases as INT))<1000000 -- this is filtering on grouped values
ORDER BY SUM(CAST(c.cases as INT)) DESC  -- order on sum of cases

This query will show us the top 10 countries and the total sum of cases in year 2020, but only for countries where that total is less than a million. There is a lot to unpack here:

  • cases column is declared as NVARCHAR(150) meaning Unicode strings of varied length, but at most 150 characters, so we need to cast it to INT (integer) to be able to apply summing to it
  • there are two different ways of filtering: WHERE, which applies to the data before grouping, then HAVING, which applies to data after grouping
  • filtering, grouping, ordering all work with unaliased columns, so even if Entity is returned as Country, I cannot do WHERE Country='Romania'
  • grouping allows to get a row for each combination of the columns the grouping is done and compute some sort of aggregation (in the case above, a sum of cases per country)

Here are the results:

Let me rewrite this in a way that is more readable using what is called a subquery, in other words a query from which I will query once again:

SELECT TOP 10
    Country,
	SUM(Cases) as Cases
FROM (
    SELECT
        c.Entity as Country,
        CAST(c.cases as INT) as Cases,
	    YEAR([Date]) as [Year]
FROM CasesSince100 c
) x
WHERE [Year]=2020
GROUP BY Country
HAVING SUM(Cases)<1000000
ORDER BY Cases DESC

Note that I still have to use SUM(Cases) in the HAVING clause. I could have grouped it in another subquery and selected again and so on. In order to select from a subquery, you need to name it (in our case, we named it x). Also I selected Country from x, which I could have also written as x.Country. As I said before, table names (aliased or not) are optional if the column name if unambiguous. Also you may notice that I've given a name to the summed column. I could have skipped that, but that would mean the resulting columns would have had no name and the query itself would have been difficult to use in code (extracted column values would have had to be retrieved by index and not by name, which is never recommended).

If you think about it, the order of the clauses in a SELECT operation has a major flaw: you are supposed to write SELECT, then specify what columns you want and only then specify where you want the columns to be read from. This makes code completion problematic, which is why the in code query language for .NET (LInQ) puts the selection at the end. But even so there is a trick:

  • SELECT * and then complete the query
  • go back and replace the * with the column names you want to extract (you will now have Intellisense code completion)
  • the alias of the tables will now come in handy, but even without aliases one can press Ctrl-Space and get a list of possible values to select

Defining tables and inserting data

Before we start inserting information, let's create a table:

CREATE TABLE Food(
    Id INT IDENTITY(1,1) PRIMARY KEY,
    FoodName NVARCHAR(100),
    Quantity INT
)

One important concept in SQL is the primary key. It is a good idea in most cases that your tables have a primary key which identifies each record uniquely and also makes them easy to reference. Let me give you an example. Let's assume that we would put no Id column in our Food table and then we would accidentally add cheese twice. How would you reference the first record as opposed to the second? How would you delete the second one?

A primary key is actually just a special case of a unique index, clustered by default. We will get to indexes later, so don't worry about that yet. Enough to remember that it is fastest (most efficient) to find records by the primary key than any other column combination and the way records are uniquely identified. 

The IDENTITY(1,1) notation tells SQL Server that we will not insert values in that column and instead let it put values starting with 1, then increasing with 1 each time. That functionality will become clear when we INSERT data in the table:

INSERT INTO Food(FoodName,Quantity)
VALUES('Bread',1),('Cheese',1),('Pork',2),('Chilly',10)

Selecting from our Food table now gets us these results:

As you can see, we've inserted four records, by only specifying two out of three columns - we skipped Id. Yet SQL has filled the column with values from 1 to 4, starting with 1 and incrementing each time with 1.

The VALUES syntax is specifying inline data, but we could, in fact, insert into a table the results of a query, something like this:

INSERT INTO Food(FoodName,Quantity)
SELECT [Name],Quantity
FROM Store
WHERE [Type]='Food'

There is another syntax for insert that is useful with what are called temporary tables, tables created for the purpose of your session (lifetime of the query window) and that will automatically disappear once the session is over. It looks like this:

SELECT FoodName,Quantity
INTO #temp
FROM Food

This will create a table (temporary because of the # sign in front of it) that will have just FoodName and Quantity as columns, then proceed on saving the data there. This table will not have a primary key nor any types of indexes and it will work as a simple dump of the data selected. You can add indexes later or alter the table in any way you want, it works just like a regular table. While a convenient syntax (you don't have to write a CREATE TABLE query or think of the type of columns) it has a limited usefulness and I recommend not using it in application code.

Just as one creates a table, there are DROP TABLE and ALTER TABLE statements that delete or change the structure of the table, but we won't go into that.

Changing existing data

So now we have some data in a table that we have defined. We will see how the alias syntax I discussed in the SELECT section will come in handy. In short, I propose you use just two basic syntax forms for all CRUD operations: one for INSERT and one for SELECT, UPDATE and DELETE.

But how can you use the same syntax for statements that are so different, I hear you ask? Let me give you some example of similar code doing just that before I dive in what each operation does.

SELECT *
FROM Food f
WHERE f.Id=4

UPDATE f
SET f.Quantity=9
FROM Food f
WHERE f.Id=4

DELETE FROM f
FROM Food f
WHERE f.Id=4

The last two lines of all operations are exactly the same. These are simple queries, but imagine you have a complex one to craft. The first thing you want to see is that you are updating or deleting the right thing, therefore it makes sense to start with a SELECT query instead, then change it to a DELETE or UPDATE when satisfied. You see I UPDATE and DELETE using the alias I gave the table.

When first learning UPDATE and DELETE statements, one usually gets to this syntax:

UPDATE Food     -- using the table name is cumbersome if in a complex query
SET Quantity=9  -- unless using Food.Quantity and Food.Id
WHERE Id=4      -- you don't get easy Intellisense

DELETE          -- this seems a lot easier to remember
FROM Food       -- but it only works with one table in a simple query
WHERE Id=4

I've outlined some of the reasons I don't use this syntax in the comments, but the most important reason why one shouldn't use them except for very simplistic cases is that you are trying to create a query to destructively change the data in the database and there is no fool proof way to duplicate the same logic in a SELECT query to verify what you are going to change. I've seen people (read that as: I was dumb enough to do it myself) who created an entire different SELECT statement to verify what they would do, then realize to their horror the statements were not equivalent and they had updated or deleted the wrong thing!

OK, let's look at UPDATE and DELETE a little closer.

One of the useful clauses for these statements is, just like with SELECT, the TOP clause, which instructs SQL to affect just a finite number of rows. However, because TOP has been added later for write operations, you need to encase the value (or variable) in parentheses. For SELECT you can skip the parentheses for constant values (you still need them for variables)

DELETE TOP (10) FROM MyTable

Another interesting clause, that frankly I have not used a lot, but is essential in some specific cases, is OUTPUT. One can delete or update some rows and at the same time get the rows they have changed. The reason being that first of all in a DELETE statement the rows will be gone, so you won't be able to SELECT them again. But even in an UPDATE operation, the rows chosen to be updated by a query may not be the same if you execute them again. 

SQL does not guarantee the order of rows unless specifically using ORDER BY. So if you execute SELECT TOP 10 * FROM MyTable twice, you may get two different results. Moreover, between the time you UPDATE some rows and you SELECT them in another query, things may change because of other processes running at the same time on the same data.

So let's say we have some for of Invoices and Items tables that reference each other. You want to delete one invoice and all the items associated with it. There is no way of telling SQL to DELETE from multiple tables at the same time, so you DELETE the invoice, OUTPUT its Id, then delete the items for that Id.

CREATE TABLE #deleted(Id INT) -- temporary table, but explicitly created

DELETE FROM Invoice 
OUTPUT Deleted.Id    -- here Deleted is a keyword
INTO #deleted        -- the Id from the deleted rows will be stored here
WHERE Id=2           -- and can be even be restored from there

DELETE 
FROM Item
WHERE Id IN (
  SELECT Id FROM #deleted
)  -- a subquery used in a DELETE statement

-- same thing can be written as:
DELETE FROM i
FROM Item i
INNER JOIN #deleted d  -- I will get to JOINs soon
ON i.Id=d.Id

I have been informed that the INTO syntax is confusing and indeed it is:

  • SELECTing INTO will create a new table with results and throw an exception if the table already exists. The table will have the names and types of the selected values, which may be what one wants for a quick data dump, but it may also cause issues. For example the following query would throw an exception:
    SELECT 'Blog' as [Name]
    INTO #temp
    
    INSERT INTO #temp([Name]) -- String or binary data would be truncated error
    VALUES('Siderite')
    ​

    because the Name column of the new temporary table would be VARCHAR(4), just like 'Blog' and 'Siderite' would be too long

  • UPDATEing or DELETEing with OUTPUT INTO will require an existing table with the same number and types of columns as the columns specified in the OUTPUT clause and will throw an exception if it doesn't exist

One can use derived values in UPDATE statements, not just constants. One can reference the columns already existing or use any type of function that would be allowed in a similar SELECT statement. For example, here is a query to get the tax value of each row and the equivalent update to store it into a separate column:

SELECT
    i.Price, 
    i.TaxPercent, 
    i.Price*(i.TaxPercent/100) as Tax  -- best practice: SELECT first
FROM Item i

UPDATE i
SET Tax = i.Price*(i.TaxPercent/100)   -- UPDATE next
FROM Item i

So here we first do a SELECT, to see if the values we have and calculate are correct and, if satisfied, we UPDATE using the same logic. Always SELECT before you change data, so you know you are changing the right thing.

There is another trick to help you work safely, one that works on small volumes of data, which involves transactions. Transactions are atomic operations (all or nothing) which are defined by starting them with BEGIN TRANSACTION and are finalized with either COMMIT TRANSACTION (save the changes to the database) or ROLLBACK TRANSACTION (revert changes to the database). Transactions are an advanced concept also, so read about it yourself, but remember one can do the following:

  • open a new query window
  • execute BEGIN TRANSACTION
  • do almost anything in the query window
  • if satisfied with the result execute COMMIT TRANSACTION
  • if any issue with what you've done execute ROLLBACK TRANSACTION to undo the changes

Note that this only applies for stuff you do in that query window. Also, all of these operations are being saved in the log of the database, so this works only with small amounts of data. Attempting to do this with large amounts of data will practically duplicate it on disk and take a long time to execute and revert.

The NULL value

We need a quick primer on what NULL is. NULL is a placeholder for a value that was not set or is considered unknown. It's a non-value. It is similar to null in C# or JavaScript, but with some significant differences applicable to SQL only. For example, a NULL value (an oxymoron for sure) will never be equal to (or not equal to) or less than or greater than anything. One might expect to get all the values in a table in these two queries: SELECT * FROM MyTable WHERE Value>5 and SELECT * FROM MyTable WHERE Value<=5. But if any rows will have NULL for a Value, then they will not appear in any of the query results. That applies to the negation operator NOT as well: SELECT * FROM MyTable WHERE NOT (Value>5).

This behavior can be changed by using SET ANSI_NULLS OFF, but I am yet to see a database that has ever been set up like this.

To check if a value is or is not NULL, one uses the IS and IS NOT syntax :)

SELECT *
FROM MyTable
WHERE MyValue IS NOT NULL

The NULL concept will be used a lot in the next chapter.

Combining data from multiple sources

We finally go to JOIN operations. In most scenarios, you have a database containing multiple table, with intricate connections between them. Invoices that have items, customers, the employee that processed it, dates, departments, store quantities, etc., all referencing something. Integrating data from multiple tables is a complex subject, but I will touch just the most common and important parts:

  • INNER JOIN
  • OUTER JOIN
  • EXISTS
  • UNION / UNION ALL

Let's write a query that displays the name of employees and their department. I will show the CREATE TABLE statements, too, in order to see where we get the data from:

CREATE TABLE Employee (
  EmployeeId INT,          -- Best practice: descriptive column names
  FirstName NVARCHAR(100),
  LastName NVARCHAR(100),
  DepartmentId INT)        -- Best practice: use same name for the same thing

CREATE TABLE Department (
  DepartmentId INT,        -- same thing here
  DepartmentName NVARCHAR(100)
)

SELECT
    CONCAT(FirstName,' ',LastName) as Employee,
    DepartmentName
FROM Employee e
INNER JOIN Department d
ON e.DepartmentId=d.DepartmentId

Here it is: INNER JOIN, a clause that combines the data from two tables based ON a condition or series of conditions. For each row of Employee we are looking for the corresponding row of Department. In this example, one employee belongs to only one department, but a department can hold multiple employees. It's what we call a "one to many relationship". One can have "one to one" or "many to many" relationships as well. That is very important when trying to gauge performance (and number of returned rows).

Our query will only find at most one department for each employee, so for 10 employees we will get at most 10 rows of data. Why do I say "at most"? Because the DepartmentId for some employees might not have a corresponding department row in the Department table. INNER JOIN will not generate records if there is no match. But what if I want to see all employees, regardless if their department exists or not? Then we use an OUTER JOIN:

SELECT
    CONCAT(FirstName,' ',LastName) as Employee,
    DepartmentName
FROM Employee e
LEFT OUTER JOIN Department d
ON e.DepartmentId=d.DepartmentId

This will generate results for each Employee and their Department, but show a NULL (without value) result if the department does not exist. In this case LEFT is used to define that there will be rows for each record in the left table (Employee). We could have used RIGHT, in which case we would have rows for each department and NULL values for departments that have no employees. There is also the FULL OUTER JOIN option, in which case we will get both departments with NULL employees if none are attached and employees with NULL departments in case the department does not exist (or the employee is not assigned - DepartmentId is NULL)

Note that the keywords INNER and OUTER are completely optional. JOIN is the same thing as INNER JOIN and LEFT JOIN is the same as LEFT OUTER JOIN. I find that specifying them makes the code more readable, but that's a personal choice.

The OUTER JOINs are sometimes used in a non intuitive way to find records that have no match in another table. Here is a query that shows employees that are not assigned to a department:

SELECT
    CONCAT(FirstName,' ',LastName) as Employee
FROM Employee e
LEFT OUTER JOIN Department d
ON e.DepartmentId=d.DepartmentId
WHERE d.DepartmentId IS NULL

Until now, we talked about the WHERE clause as a filter that is applied first (before grouping) so one might intuitively have assumed that the WHERE clauses are applied immediately on the tables we get the data from. If that were the case, then this query would never return anything, because every Department will have a DepartmentId. Instead, what happens here is the tables are LEFT JOINed, then the WHERE clause applies next. In the case of unassigned employees, the department id or name will be NULL, so that is what we are filtering on.

So what happens above is:

  • the Employee table is LEFT JOINed with the Department table
  • for each employee (left) there will be rows that contain the values of the Employee table rows and the values of any matched Department table rows
  • in the case there is no match, NULL values will be returned for the Department table for all columns
  • when we filter by Department.DepartmentId being NULL we don't mean any Department that doesn't have an Id (which is impossible) but any Employee row with no matching Department row, which will have a NULL value where the Department.DepartmentId value would have been in case of a match.
  • not matching can happen for two reasons: Employee.DepartmentId is NULL (meaning the employee has not been assigned to a department) or the value stored there has no associated Department (the department may have been removed for some reason)

Also, note that if we are joining tables on some condition we have to be extra careful with NULL values. Here is how one would join two tables on VARCHAR columns being equal even when NULL:

SELECT *
FROM Table1 t1
INNER JOIN Table2 t2
ON (t1.Value IS NULL AND t2.Value IS NULL) OR t1.Value=t2.Value

SELECT *
FROM Table1 t1
INNER JOIN Table2 t2
ON ISNULL(t1.Value,'')=ISNULL(t2.Value,'')

The second syntax seems promising, doesn't it? It is more readable for sure. Unfortunately, it introduces some assumptions and also decreases the performance of the query (we will talk about performance later on). The assumption is that if Value is an empty string, then it's the same as having no value (being NULL). One could use something like ISNULL(Value,'--NULL--') but now it starts looking worse.

There are other ways of joining two tables (or queries, or table variables, or table functions, etc.), for example by using the IN or the EXISTS/NOT EXISTS clauses or subqueries. Here are some examples:

SELECT *
FROM Table1
WHERE MyValue IN (SELECT MyValue FROM Table2)

SELECT *
FROM Table1
WHERE MyValue = (SELECT TOP 1 MyValue FROM Table2 WHERE Table1.MyValue=Table2.MyValue)

SELECT *
FROM Table1
WHERE NOT EXISTS(SELECT * FROM Table2 WHERE Table1.MyValue=Table2.MyValue)

These are less readable, usually have terrible performance and may not return what you expect them to return.

When I was learning SQL, I thought using a JOIN would be optimal on all cases and subqueries in the WHERE clause were all bad, no exception. That is, in fact, false. There is a specific case where it is better to use a subquery in WHERE instead of JOIN, and that is when trying to find records that have at least one match. It is better to use EXISTS because it is short-circuiting logic which leads to better performance.

Here is an example with different syntax for achieving the same goal:

SELECT DISTINCT d.DepartmentId
FROM Department d
INNER JOIN Employee e
ON e.DepartmentId=d.DepartmentId

SELECT d.DepartmentId
FROM Department d
WHERE EXISTS(SELECT * FROM Employee e WHERE e.DepartmentId=d.DepartmentId)

Here, the search for departments with employees will return the same thing, but in the first situation it will get all employees for all departments, then list the department ids that had employees, while in the second query the department will be returned the moment just one employee that matches is found.

There is another way of combining data from two sources and that is to UNION two or multiple result sets. It is the equivalent of taking rows from multiple sources of the same type and showing them together in the same result set.

Here is a dummy example:

SELECT 1 as Id
UNION
SELECT 2
UNION
SELECT 2

And we execute it and...

What happened? Shouldn't there have been three values? Somehow, when copy pasting the silly example, you added two identical values. UNION will add only distinct values to the result set. using UNION ALL will show all three values.

SELECT 1 as Id
UNION ALL
SELECT 2
UNION ALL
SELECT 2

SELECT DISTINCT Id FROM (
  SELECT 1 as Id
  UNION ALL
  SELECT 2
  UNION ALL
  SELECT 2
) x

The first query will return 1,2,2 and the second will be the equivalent of the UNION one, returning 1 and 2. Note the DISTINCT keyword.

My recommendation is to never use UNION and instead use UNION ALL everywhere, unless it makes some kind of sense for a very specific scenario, because the operation to DISTINCT values is expensive, especially for many and/or large columns. When results are supposed to be different anyway, UNION and UNION ALL will return the same output, but UNION is going to perform one more pointless distinct operation.

After learning about JOIN, my request to start with SELECT queries and only them modify them to be UPDATE or DELETE begins to make more sense. Take a look at this query:

UPDATE d
SET ToFindManager=1
--SELECT *
FROM Department d
LEFT OUTER JOIN Employee e
ON d.DepartmentId=e.DepartmentId
AND e.[Role]='Manager'
WHERE e.EmployeeId IS NULL

This will set ToFindManager in departments that have no corresponding manager. But if you select the text from SELECT * on and then execute, you will get the results that you are going to update. Same query, executing by selecting different sections of it will either verify or perform the operation.

Indexes and relationships. Performance.

We have seen how to define tables, how to insert, select, update and delete records from them. We've also seen how to integrate data from multiple sources to get what we want. The SQL engine will take our queries, try to understand what we meant, optimize the execution, then give us the results. However, with large enough data, no amount of query optimization will help if the relationships between tables are not properly defined and tables are not prepared for the kind of queries we will execute.

This requires an introduction to indexes, which is a rather advanced idea, both in terms of how to create, use, debug and profile, but also as a computer science concept. I will try to stick to the basics here, and you go and get more in depth from here.

What is an index? It's a separate data structure that will allow quick access to specific parts of the original data. A table of contents in a blog post is an index. It allows you to quickly jump to the section of the post without having to read it all. There are many types of indexes and they are used in different ways.

We've talked about the primary key: (unless specified differently) it's a CLUSTERED, UNIQUE index. It can be on a single column or a combination of columns. Normally, the primary key will be the preferred way to find or join records on, as it physically rearranges the table records in order and insures only one record has a particular primary key.

The difference between CLUSTERED and NONCLUSTERED indexes is that a table can have only one clustered index, which will determine the physical order of record data on the disk. As an example, let's consider a simple table with a single integer column called X. If there is a clustered index on X, then when inserting new values, data will be moved around on the disk to account for this:

CREATE TABLE Test(X INT PRIMARY KEY)

INSERT INTO Test VALUES (10),(1),(20)

INSERT INTO Test VALUES (2),(3)

DELETE FROM Test WHERE X=1

After inserting 10,1 and 20, data on the disk will be in the order of X: a 1, followed by a 10, then a 20. When we insert values 2 and 3, 10 and 20 will have to be moved so that 2 and 3 are inserted. Then, after deleting 1, all data will be moved so that the final physical order of the data (the actual file on the disk holding the database data) will be 2,3,10,20. This will help optimize not only finding the rows, but also efficiently reading them from disk (disk access is the most expensive operation for a database). 

Note: deletion is working a little differently in reality, but in theory this is how it would work.

Nonclustered indexes, on the other hand, keep their own order and reference the records from the original data. For such a simple example as above, the result would be almost identical, but imagine you have the Employee table and you create a nonclustered index on LastName. This means that behind the scenes, a data structure that looks like a table is created, which is ordered by LastName and contains another column for EmployeeId (which is the primary key, the identifier of an employee). When you do SELECT * FROM Employee ORDER BY LastName, the index will be used to first get a list of ids, then select the values from them.

A UNIQUE index also insures that no two records will have the same combination of values as defined therein. In the case of the primary key, there cannot be two records with the same id. But one can imagine something like:

CREATE UNIQUE INDEX IX_Employee_Name ON Employee(FirstName,LastName)

INSERT INTO Employee (FirstName,LastName)
VALUES('Siderite','Blog')

IX_Employee_Name is a nonclustered unique index on FirstName and LastName. If you execute the insert, it will work the first time, but fail the second time:

There is another type of index-like structure called a foreign key. It should be used to define logical relationships between tables. For the Department table, DepartmentId should be a primary key, but in the Employee table, DepartmentId should be defined as a foreign key connecting to the column in the Department table.

Important note: a foreign key defines the relationship, but doesn't index the column. A separate index should be added on the Employee.DepartmentId column for performance reasons.

I don't want to get into foreign keys here. Suffice to say that once this relationship is defined, some things can be achieved automatically, like deleting corresponding Item records by the engine when deleting Invoices. Also the performance of JOIN queries increases.

Indexes can be used not only on equality, but also other more complex cases: numerical ranges, prefixes, etc. It is important to understand how they are structured, so you know when to use them.

Let's consider the IX_Employee_Name index. The index is practically creating a tree structure on the concatenation of the first and last name of the employee and stores the primary key columns for the table for reference. It will work great for increasing performance of a query like SELECT * FROM Employee ORDER BY FirstName or SELECT * FROM Employee WHERE FirstName LIKE 'Sid%'. However it will not work for LastName queries or contains queries like SELECT * FROM Employee ORDER BY LastName or SELECT * FROM Employee WHERE FirstName LIKE '%derit%'.

That's important because sometimes simpler queries will take more resources than more complicated ones. Here is a dumb example:

CREATE INDEX IX_Employee_Dumb ON Employee(
    FirstName,
    DepartmentId,
    LastName
)

SELECT *
FROM Employee e
WHERE e.FirstName='Siderite'
  AND e.LastName='Blog'

SELECT *
FROM Employee e
WHERE e.FirstName='Siderite'
  AND e.LastName='Blog'
  AND e.DepartmentId=1

The index we create is called IX_Employee_Dumb and it creates a data structure to help find rows by FirstName, DepartmentId and LastName in that order. 

For some reason, in our employee table there are a lot of people called Siderite, but with different departments and last names. The first query will use the index to find all Siderite employees (fast), then look into each and check if LastName is 'Blog' (slow). The second query will directly find the Siderite Blog employee from department with id 1 (fast), because it uses all columns in the index. As you can see, the order of columns in the index is important, because without the DepartmentId in the WHERE clause, only the first part of the index, for FirstName, can be used. In the last query, because we specify all columns, the entire index can be used to efficiently locate the matching rows. 

Note 2022-09-06: Partitioning a table (advanced concept) takes precedence to indexes. I had a situation where a table was partitioned on column RowDate into 63 partitions. The primary key was RowId, but when you SELECTed on RowId, there were 63 index seeks performed. If queried on RowId AND RowDate, it went to the containing partition and did only one index seek inside it. So careful with partitioning. It only provides a benefit if you query on the columns you use to partition on.

One more way of optimizing queries is using the INCLUDE clause. Imagine that Employee is a table with a lot of columns. On the disk, each record is taking a lot of space. Now, we want to optimize the way we get just FirstName and LastName when searching in a department:

SELECT FirstName,LastName
FROM Employee
WHERE DepartmentId=@departmentId

That @ syntax is used for variables and parameters. As a general rule, any values you send to an SQL query should be parameterized. So don't do in C# var sql = "SELECT * FROM MyTable WHERE Id="+id, instead do var sql="SELECT * FROM MyTable WHERE Id=@id" and add an @id parameter when running the query.

So, in the query above SQL will do the following:

  • use an index for DepartmentId if any (fast)
  • find the EmployeeId
  • read the (large) records of each employee from the table (slow)
  • extract and return the first and last name for each

But add this index and there is no need to even go to the table:

CREATE INDEX IX_Employee_DepWithNames
  ON Employee(DepartmentId)
  INCLUDE(FirstName,LastName)

What this will do is add the values of FirstName and LastName to the data inside the index and, if only selecting values from the include list, return them from the index directly, without having to read records from the initial table.

Note that DepartmentId is used to locate rows (in WHERE and JOIN ON clauses) while FirstName and LastName are the columns one SELECTs.

Indexes are a very complex concept and I invite you to examine it at length. It might even be fun.

When indexes are bad

Before I close, let me tell you where indexes are NOT recommended.

One might think that adding an index for each type of query would be a good thing and in some scenarios it might, but as usual in database work, it depends. What performance you gain for finding records in SELECT, UPDATE and DELETE statements, you lose with INSERT, UPDATE and DELETE data changes.

As I explained before, indexes are basically hidden tables themselves. Slight differences, but the data they contain is similar, organized in columns. Whenever you change or add data, these indexes will have to be updated, too. It's like writing in multiple tables at the same time and it affects not only the execution time, but also the disk space.

In my opinion, the index and table structure of a database depends the most on if you intend to read a lot from it or write a lot to it. And of course, everybody will scowl and say: "I want both! High performance read and write". My recommendation is to separate the two cases as much as possible.

  • You want to insert a lot of data and often? Use large tables with many columns and no indexes, not even primary keys sometimes.
  • You want to update a lot of data and often? Use the same tables to insert the modifications you want to perform.
  • You want to read a lot of data and often? Use small read only tables, well defined, normalized data, clear relationships between tables, a lot of indexes
  • Have a background process to get inserts and updates and translate them into read only records

Writing data and reading data, from the SQL engine perspective, are very very different things. They might as well be different software and indeed some companies use one technology to insert data (like NoSQL databases) and another to read it.

Conclusion

I hope the post hasn't been too long and that it will help you when beginning with SQL. Please leave any feedback that you might have, the purpose of this blog is to help people and every perspective helps.

SQL is a very interesting idea and has changed the way people think of data access. However, it has become so complex that most people are still confused even after years of working with it. Every year new features are being added and new ideas are put forward. Yet there are a few concepts, a foundation if you will, that will get you most of the way there. This is what I have tried to distil here. Hope I succeeded.

  I was attempting to optimize an SQL process that was cleaning records from a big table. There are a multitude of ways of doing this, but the pattern that I had adopted for the last similar tasks were to delete rows in batches using the TOP (@rowCount) syntax. And it had all worked fine until then, but now my "optimization" increased the run time from 6 minutes to 2 hours! Humbled (or more like humiliated) I started to analyze what was going on.

  First thing I did was to SET STATISTICS IO ON. Then I ran the cleaning task again. And lo and behold, there was a row reporting accessing an object that was not part of the query itself. What was going on? At first I thought that I was using a VIEW somewhere, one that I had thought was a table, but no, there was no reference to that object anywhere. But when I looked for that object is was a view!

  The VIEW in question was a view with SCHEMABINDING, to which several indexes were then created. That explained it all. If you ever attempted to create an index on a view you probably got the error "Cannot create index on view, because the view is not schema bound" and then you investigated what that entailed (and probably gave up because of all the restrictions) but in that first moment when you thought "all I have to do is add WITH SCHEMABINDING and I can index my views!" it seemed like a good idea. It might even be a good idea for several scenarios, but what it also does is create a reverse dependency on the object you are using. Moreover, if you look more carefully at the Microsoft documentation it says: "The query optimizer may use indexed views to speed up the query execution. The view does not have to be referenced in the query for the optimizer to consider that view for a substitution." So you may find yourself querying a table and instead the engine queries a view instead!

  You see, what happens is that every time when you delete 4900 rows from a table that is used by a view that has indexes on it is those indexes are being recreated, so not only your table is affected, but potentially everything that is being called in the view as well. If it's a complicated view that integrates data from multiple sources, it will be run after every batch delete and indexed. Again. And again. And again again. It also prohibits you from some operations, like TRUNCATE TABLE, where you get a funny message saying it's referenced by a view and that is why you can't truncate it. What?!

  Now, I deleted the VIEW and ran the same code. It was faster, but it still took ages because finding the records to delete was a much longer operation than the deletion itself. This post is about this reverse dependency that an indexed view introduces.

  So what is the solution? What if you have the view, you need the view and you also need it indexed? You can disable the indexes before your operation, then enable them again. I believe this will solve most issues, even if it's not a trivial operation. Just remember that in cleaning operations, you need some indexes to find the records to delete as well.

  That's it. I hope it helps. Get out of here!

  When we connect to SQL we usually copy/paste some connection string and change the values we need and rarely consider what we could change in it. That is mostly because of the arcane looking syntax and the rarely read documentation for it. You want to connect to a database, give it the server, instance and credentials and be done with it. However, there are some parameters that, when set, can save us a lot of grief later on.

  Application Name is something that identifies the code executing SQL commands to SQL Server, which can then be seen in profilers and DMVs or used in SQL queries. It has a maximum length of 128 characters. Let's consider the often met situation when your application is large enough to be segregated into different domains, each having their own data access layer, business rules and user interface. In this case, each domain can have its own connection string and it makes sense to specify a different Application Name for each. Later on, one can use SQL Profiler, for example, and filter on the specific area of interest.

 

  The application name can also be seen in some queries to SQL Server's Dynamic Management Views (quite normal, considering DMVs are used by SQL Profiler) like sys.dm_exec_sessions. Inside your own queries you can also get the value of the application name by simply calling APP_NAME(). For example, running SELECT APP_NAME(); in SQL Management Studio returns a nice "Microsoft SQL Server Management Studio - Query" value. In SQL Server Profiler the column is ApplicationName while in DMVs like sys.dm_exec_sessions the column is program_name.

  Example connection string: Server=localhost;Database=MyDatabase;User Id=Siderite;Password=P4ssword; Application Name=Greatest App Ever

  Hope it helps!

  Interesting SQL table hint I found today: READPAST. It instructs SQL queries to ignore locked rows. This comes with advantages and disadvantages. For one it avoids deadlocks when trying to read or write an already locked row, but it also provides the wrong results. Just as NOLOCK, it works around the transaction mechanism, and while NOLOCK will allow dirty reads of information partially changed in transactions that have not been committed, READPAST ignores its existence completely.

  There is one scenario where I think this works best: batched DELETE operations. You want to delete a lot of rows from a table, but without locking it. If you just do a delete for the entire table with some condition you will get these issues:

  • the operation will be slow, especially if you are deleting on a clustered index which moves data around in the table
  • if the number of deleted rows is too large (usually 5000 or more) then the operation will lock the entire table, not just the deleted rows
  • if there are many rows to be deleted, the operation will take a long while, increasing the possibility of deadlocks

  While there are several solutions for this, like partitioning the table and then truncating the partitions or soft deletes or designing your database to separate read and write operations, one type of implementation change that is small in scope and large is result is batched deletes. Basically, you run a flow like this:

  1. SELECT a small number of rows to be deleted (again, mind the 5000 limit that causes table locks, perhaps even use ROWLOCK hint)
  2. DELETE rows selected and their dependencies (DELETE TOP x should work as well for steps 1 and 2, but I understand in some cases this syntax automatically causes a table lock and maybe also use ROWLOCK hint)
  3. if the number of selected rows is larger than 0, go back to step 1

  This allows SQL to lock individual rows and, if your business logic is sound, no rows should be deleted while something is trying to read or write them. However, this is not always the case, especially in high stress cases with many concurrent reads and writes. But here, if you use READPAST, then locked rows will be ignored and the next loops will have the chance to delete them.

  But there is a catch. Let's take an example:

  1. Table has 2 rows: A and B
  2. Transaction 1 locks row A
  3. In a batched delete scenario, Transaction 2 gets the rows with READPAST and so only gets B
  4. Transaction 2 deletes row B and commits, and continues the loop
  5. Transaction 3 gets the rows with READPAST and gets no rows (A is still locked)
  6. Transaction 3 deletes nothing and exists the loop
  7. Transaction 1 unlocks row A
  8. Table now has 1 row: A, which should have been deleted, but it's not

  There is a way to solve this: SELECT with NOLOCK and DELETE with READPAST

  • this will allow to always select even locked and uncommitted rows
  • this will only delete rows that are not locked
  • this will never deadlock, but will loop forever as long as some rows remain locked

  One more gotcha is that READPAST allows for a NOWAIT syntax, which says to immediately ignore locked rows, without waiting for a number of seconds (specified by LOCK_TIMEOUT) to see if it unlocks. Since you are doing a loop, it would be wise to wait, so that it doesn't go into a rapid loop while some rows are locked. Barring that, you might want to use READPAST NOWAIT and then add a WAITFOR DELAY '00:00:00.010' at the end of the loop to add 10 millisecond delay, but if you have a lot of rows to delete, it might make this too slow.

  Enough of this, lets see some code example:

DECLARE @batchSize INT = 1000
DECLARE @nrRows INT = 1

CREATE TABLE #temp (Id INT PRIMARY KEY)

WHILE (@nrRows>0)
BEGIN

  BEGIN TRAN

	INSERT INTO #temp
    SELECT TOP (@batchSize) Id
    FROM MyTable WITH (NOLOCK)
    WHERE Condition=1

    SET @nrRows = @@ROWCOUNT

    DELETE FROM mt 
    FROM MyTable mt WITH (READPAST NOWAIT)
    INNER JOIN #temp t
    ON mt.Id=t.Id

	WAITFOR DELAY '00:00:00.010'

  COMMIT TRAN

END

DROP TABLE #temp

Now the scenario goes like this:

  1. Table has 2 rows: A and B
  2. Transaction 1 locks row A
  3. Transaction 2 gets the rows with NOLOCK and so only gets A and B
  4. Transaction 2 deletes rows A and B with READPAST, but only B is deleted
  5. loop continues (2 rows selected)
  6. Transaction 3 gets the rows with NOLOCK and gets one row 
  7. Transaction 3 deletes with READPAST with no effect (A is still locked)
  8. loop continues (1 rows selected)
  9. Transaction 1 unlocks row A
  10. Transaction 4 gets the rows with NOLOCK and gets row A (not locked)
  11. Transaction 4 deleted with READPAST and deletes row A
  12. loop continues (1 rows selected), but next transaction selects nothing, so loop ends (0 rows selected)
  13. Table now has no rows and no deadlock occurred

Hope this helps.

 So I got assigned this bug where date 1900-01-01 was displayed on the screen so, as I am lazy, I started to look into the code without reproducing the issue. The SQL stored procedure looked fine, it was returning:

SELECT
  CASE SpecialCase=1 THEN ''
  ELSE SomeDate
  END as DateFilteredBySpecialCase

Then the value was being passed around through various application layers, but it wasn't transformed into anything, then it was displayed. So where did this magical value come from? I was expecting some kind of ISNULL(SomeDate,'1900-01-01') or some change in the mapping code or maybe SomeDate was 1900-01-01 in some records, but I couldn't find anything like that.

Well, at second glance, the selected column has to have a returning type, so what is it? The Microsoft documentation explains:

Returns the highest precedence type from the set of types in result_expressions and the optional else_result_expression. For more information, see Data Type Precedence.

If you follow that link you will see that strings are at the very bottom, while dates are close to the top. In other words, a CASE statement that returns strings and dates will always have the return type a date!

SELECT CAST('' as DATETIME) -- selects 1900-01-01

Just a quickie. Hope it helps.

and has 0 comments

  Tracing and logging always seem simple, an afterthought, something to do when you've finished your code. Only then you realize that you would want to have it while you are testing your code or when an unexpected issue occurs in production. And all you have to work with is an exception, something that tells you something went wrong, but without any context. Here is a post that attempts to create a simple method to enhance exceptions without actually needing to switch logging level to Trace or anything like that and without great performance losses.

  Note that this is a proof of concept, not production ready code.

  First of all, here is an example of usage:

public string Execute4(DateTime now, string str, double dbl)
{
    using var _ = TraceContext.TraceMethod(new { now, str, dbl });
    throw new InvalidOperationException("Invalid operation");
}

  Obviously, the exception is something that would occur in a different way in real life. The magic, though, happens in the first line. I am using (heh!) the new C# 8.0 syntax for top level using statements so that there is no extra indentation and, I might say, one of the few situations where I would want to use this syntax. In fact, this post started from me thinking of a good place to use it without confusing any reader of the code.

  Also, TraceContext is a static class. That might be OK, since it is a very special class and not part of the business logic. With the new Roslyn source generators, one could insert lines like this automatically, without having to write them by hand. That's another topic altogether, though.

  So, what is going on there? Since there is no metadata information about the names of the currently executing method (without huge performance issues), I am creating an anonymous object that has properties with the same names and values as the arguments of the method. This is the only thing that might differ from one place to another. Then, in TraceMethod I return an IDisposable which will be disposed at the end of the method. Thus, I am generating a context for the entire method run which will be cleared automatically at the end.

  Now for the TraceContext class:

/// <summary>
/// Enhances exceptions with information about their calling context
/// </summary>
public static class TraceContext
{
    static ConcurrentStack<MetaData> _stack = new();

    /// <summary>
    /// Bind to FirstChanceException, which occurs when an exception is thrown in managed code,
    /// before the runtime searches the call stack for an exception handler in the application domain.
    /// </summary>
    static TraceContext()
    {
        AppDomain.CurrentDomain.FirstChanceException += EnhanceException;
    }

    /// <summary>
    /// Add to the exception dictionary information about caller, arguments, source file and line number raising the exception
    /// </summary>
    /// <param name="sender"></param>
    /// <param name="e"></param>
    private static void EnhanceException(object? sender, FirstChanceExceptionEventArgs e)
    {
        if (!_stack.TryPeek(out var metadata)) return;
        var dict = e.Exception.Data;
        if (dict.IsReadOnly) return;
        dict[nameof(metadata.Arguments)] = Serialize(metadata.Arguments);
        dict[nameof(metadata.MemberName)] = metadata.MemberName;
        dict[nameof(metadata.SourceFilePath)] = metadata.SourceFilePath;
        dict[nameof(metadata.SourceLineNumber)] = metadata.SourceLineNumber;
    }

    /// <summary>
    /// Serialize the name and value of arguments received.
    /// </summary>
    /// <param name="arguments">It is assumed this is an anonymous object</param>
    /// <returns></returns>
    private static string? Serialize(object arguments)
    {
        if (arguments == null) return null;
        var fields = arguments.GetType().GetProperties();
        var result = new Dictionary<string, object>();
        foreach (var field in fields)
        {
            var name = field.Name;
            var value = field.GetValue(arguments);
            result[name] = SafeSerialize(value);
        }
        return JsonSerializer.Serialize(result);
    }

    /// <summary>
    /// This would require most effort, as one would like to serialize different types differently and skip some.
    /// </summary>
    /// <param name="value"></param>
    /// <returns></returns>
    private static string SafeSerialize(object? value)
    {
        // naive implementation
        try
        {
            return JsonSerializer.Serialize(value).Trim('\"');
        }
        catch (Exception ex1)
        {
            try
            {
                return value?.ToString() ?? "";
            }
            catch (Exception ex2)
            {
                return "Serialization error: " + ex1.Message + "/" + ex2.Message;
            }
        }
    }

    /// <summary>
    /// Prepare to enhance any thrown exception with the calling context information
    /// </summary>
    /// <param name="args"></param>
    /// <param name="memberName"></param>
    /// <param name="sourceFilePath"></param>
    /// <param name="sourceLineNumber"></param>
    /// <returns></returns>
    public static IDisposable TraceMethod(object args,
                                            [CallerMemberName] string memberName = "",
                                            [CallerFilePath] string sourceFilePath = "",
                                            [CallerLineNumber] int sourceLineNumber = 0)
    {
        _stack.Push(new MetaData(args, memberName, sourceFilePath, sourceLineNumber));
        return new DisposableWrapper(() =>
        {
            _stack.TryPop(out var _);
        });
    }

    /// <summary>
    /// Just a wrapper over a method which will be called on Dipose
    /// </summary>
    public class DisposableWrapper : IDisposable
    {
        private readonly Action _action;

        public DisposableWrapper(Action action)
        {
            _action = action;
        }

        public void Dispose()
        {
            _action();
        }
    }

    /// <summary>
    /// Holds information about the calling context
    /// </summary>
    public class MetaData
    {
        public object Arguments { get; }
        public string MemberName { get; }
        public string SourceFilePath { get; }
        public int SourceLineNumber { get; }

        public MetaData(object args, string memberName, string sourceFilePath, int sourceLineNumber)
        {
            Arguments = args;
            MemberName = memberName;
            SourceFilePath = sourceFilePath;
            SourceLineNumber = sourceLineNumber;
        }
    }
}

Every call to TraceMethod adds a new MetaData object to a stack and every time the method ends, the stack will pop an item. The static constructor of TraceMethod will have subscribed to the FirstChangeException event of the current application domain and, whenever an exception is thrown (caught or otherwise), its Data dictionary is getting enhanced with:

  • name of the method called
  • source file name
  • source file line number where the exception was thrown.
  • serialized arguments (remember Exceptions need to be serializable, including whatever you put in the Data dictionary, so that is why we serialize it all)

(I have written another post about how .NET uses code attributes to get the first three items of information during build time) 

This way, you get information which would normally be "traced" (detailed logging which is usually detrimental to performance) in any thrown exception, but without filling some trace log or having to change production configuration and reproduce the problem again. Assuming your application does not throw exceptions all over the place, this adds very little complexity to the executed code.

Moreover, this will enhance exception with the source code file name and line number even in Release mode!

I am sure there are some issues with code that might fail and it is not caught in a try/catch and of course the serialization code is where people should put a lot of effort, since different types get to be serialized for inspection differently (think async methods and the like). And more methods should be added so that people trace whatever they like in thrown exceptions. Yet, as I said, this is a POC, so I hope it gets you inspired.

 T-SQL Querying is a very good overview of SQL Server queries, indexing, best practices, optimization and troubleshooting. I can't imagine someone can just read it and be done with it, as it is full of useful references, so it's good to keep it on the table. Also, it's relatively short, so one can peruse it in a day and then keep using it while doing SQL work.

What I didn't like so much was the inconsistent level of knowledge needed for the various chapters. It starts with a tedious explanations of types of queries and what JOINs are and what ORDER BY is and so on, then moves on to the actual interesting stuff. Also, what the hell is that title and cover? :) You'd think it's a gardening book.

Another great thing about it is that it is available free online, from its publishers: Packt.

The need

  I don't know about you, but I've been living with ad blockers from the moment they arrived. Occasionally I get access to a new machine and experience the Internet without an ad blocker and I can't believe how bad it is. A long time ago I had a job where I was going by bike. After two years of not using public transport, I got in a bus and had to get out immediately. It smelled so bad! How had I ever used that before? It's the same thing.

  However, most of the companies we take for granted as pillars of the web need to bombard us with ads and generally push or skew our perceptions in order to make money, so they find ways to obfuscate their web sites, lock them in "apps" that you have no control of or plain manipulate the design of the things we use to write code so that it makes this more difficult.

Continuous war

  Let me give you an example of this arms race. You open your favorite web site and there it is, a garish, blinking, offending, annoying ad that is completely useless. Luckily, you have an ad blocker so you select the ad, see that it's enclosed in a div with class "annoyingAd" and you make a blocking rule for it. Your web site is clean again. But the site owner realizes people are not clicking on the ad anymore, so he dynamically changes the class name to something different every time you open the page. Now, you could try to decipher the JavaScript code that populates the div and write a script to get the correct class, but it would only work for this web site and you would have to know how to code in JavaScript and it would take a lot of effort you don't want to spend. But then you realize that above the horrid thing there is a title "Annoying ad you can't get rid of", so you write a simple thing to get rid of the div that contains a span with that content. Yay!

  At this point you already have some issues. The normal way people block ads is to create a quasi CSS rule for an element. Yet CSS doesn't easily let's you select elements based on the inner text or to select parents of elements with certain characteristics. In part it's a question of performance, but at the same time there is a matter of people who want to obfuscate your web site taking part in the decision process of what should go in CSS. So here, to get the element with a certain content we had to use something that expands normal CSS, like the jQuery syntax or some extra JavaScript. This is, needless to say, already applicable to a low number of people. But suspend your disbelief for a moment.

  Maybe your ad blocker is providing you with custom rules that you can make based on content, or you write a little script or even the ad blocker people write the code for you. So the site owner catches up and he does something: instead of having a span with the title, he puts many little spans, containing just a few letters, some of them hidden visually and filled with garbage, others visible. The title is now something like "Ann"+"xxx"+"oying"+"xxx"+" ad", where all "xxx" texts appear as part of the domain object model (the page's DOM) but they are somehow not visible to the naked eye. Now the inner text of the container is "Annxxxoyingxxx ad", with random letters instead of xxx. Beat that!

  And so it goes. You need to spend knowledge and effort to escalate this war that you might not even win. Facebook is the king of obfuscation, where even the items shared by people are mixed and remixed so that you cannot select them. So what's the solution?

Solution

  At first I wanted to go in the same direction, fight the same war. Let's create a tool that deobfuscates the DOM! Maybe using AI! Something that would, at the end, give me the simplest DOM possible that would create the visual output of the current page and, when I change one element in this simple DOM, it would apply the changes to the corresponding obfuscated DOM. And that IS a solution, if not THE solution, but it is incredibly hard to implement.

  There is another option, though, something that would progressively enhance the existing DOM with information that one could use in a CSS rule. Imagine a small script that, added to any page, would add attributes to elements like this: visibleText="Annoying ad" containingText="Annxxxoingxxx ad" innerText="" positionInPage="78%,30%-middle-right" positionInViewport="78%,5%-top-right". Now you can use a CSS rule for it, because CSS has syntax for attributes equal to, containing, starting or ending with something. This would somewhat slow the page, but not terribly so. One can use it as a one shot (no matter how long it takes, it only runs once) or continuous (where every time an element changes, it would recreate the attributes in it and its parents).

Feedback

  Now, I have not begun development on this yet, I've just had this idea of a domExplainer library that I could make available for everybody. I have to test how it works on difficult web sites like Facebook and try it as a general option in my browser. But I would really appreciate feedback first. What do you think? What else would you add to (or remove from) it? What else would you use it for?