A curvy WPF MVVM graphical chart using Bezier curves

Published Jul 4, 2015

In this post I will discuss the following:

creating a line chart in Windows Presentation Foundation using Canvas and the Model-View-ViewModel pattern
improving the performance of the graph
using Bezier curves to transform the graph in a continuous curvy type of graph

First, let's discuss the project that I was working on which led to this work. I wanted to do a program that manages the devices on my network. There is a router and several Wi-fi extenders, all of them with an HTTP interface. I wanted to know when they are reachable through the network, connect to the HTTP interface and gather data or perform actions like resetting the device and so on. In order to see if they were reachable, I was pinging them every second, so I thought I would like to see the evolution of the ping roundtrip time in a visual way, therefore the chart.

All of the values that I was displaying and all the commands that were available on the interface were using MVVM, the pattern developed by Microsoft for a better separation of presentation and data model. MVVM presents some difficulties, though, since most of the time directly getting the data and displaying it is more efficient and easier to do. It does allow for fantastic flexibility and good maintenance of the project. So, since I am a fan, I wanted to draw this chart via MVVM as well.

The MVVM chart

In order to do that I needed a viewmodel that abstracted the chart. Since I had several devices, each of them with a collection of pings containing the time of the ping and a nullable rountrip value, it would have been way too annoying to try to chart the values directly, so on the main viewmodel I created a specific chart model. This model contained a BindingList of items of various custom types: GraphLines, GraphStarts and GraphEnds. When the ping failed I added an "end" to the model. When the ping succeeded after a fail, I would add a "start". And when the ping was continuously successful, I would add a "line" connecting the previous ping to the current one.

So, in order to draw anything, I used a Canvas. The Canvas is a very simple container that can position stuff at absolute values. The first thing you need to realize is that it is not a vectorial type of container, so when you draw something on a small canvas and you resize the window, everything remains at the same position and size. The other thing that quickly becomes apparent is that there are various ways of positioning objects on the Canvas. The attached properties Canvas.Top, Canvas.Left, Canvas.Bottom and Canvas.Right can define the position of TextBlocks or other elements, including Rectangles and Ellipses. Lines, on the other hand, whether simple Line objects or something more complex, like Paths, are positioned using Points and X,Y coordinates. This would come to bite me on the ass later on.

WPF is very flexible. In order to add things to a Canvas, all one needs to do is to declare an ItemsControl and then redefine the ItemsPanel property to be a Canvas. The way objects are represented on the Canvas can be defined via DataTemplates, in my case one for each type of item. So I created a template that contains a Line for the GraphLine type, another for GraphStart, containing a Rectangle, and one for GraphEnd, containing an Ellipse. Forget the syntax right now, first I had to solve the problem of the different ways to position something on a Canvas and the ItemsControl. You see, in order to position a Line, all you have to do is set the X1,Y1,X2,Y2 properties, but for Ellipses and Rectangles you need to set Canvas.Left and Canvas.Top. The problem with the ItemsControl is that for each of these not primitive objects it creates a ContentPresenter to encapsulate them, therefore setting Canvas properties to the inner shape did nothing. The solution is to set a style for the ContentPresenter and set the Canvas properties on it. Surprise! Then the Lines stop working! The solution was to add several Canvases, one for the lines and one for the rectangles and ellipses, as ItemsControls, and one for static text and stuff like that, all in the same Container so that they overlap. But it worked. Then I started the program and watched the chart being displayed.

<ItemsControl ItemsSource="{Binding GraphItems}" Name="GraphLines">
    <ItemsControl.Resources>
        <DataTemplate DataType="{x:Type local:GraphLine}">
            ...
        </DataTemplate>
        <DataTemplate DataType="{x:Type local:GraphSpline}">
            ...
        </DataTemplate>
        <DataTemplate DataType="{x:Type local:GraphStart}">
            ...
        </DataTemplate>
        <DataTemplate DataType="{x:Type local:GraphEnd}">
            ...
        </DataTemplate>
    </ItemsControl.Resources>
    <ItemsControl.ItemsPanel>
        <ItemsPanelTemplate>
            <Canvas/>
        </ItemsPanelTemplate>
    </ItemsControl.ItemsPanel>
</ItemsControl>

But how did I calculate the coordinates of all of these items? As I said, the Canvas is a pretty static thing. If I resized the window, the items would remain in the same position and with the same size. Also, the viewmodel didn't have (and shouldn't have had) an idea of the actual size of the drawing Canvas. My solution was to use a MultiBinding with a custom converter. It would get two values, one would be a computed double value, from 0 to 1, that represented either vertical or horizontal position, the second would be the value of the dimension, the height or the width. The result would be, of course, the product of the two values. Luckily WPF has a very flexible Binding syntax, so it was no problem two define a value from the viewmodel and a value of the ActualWidth or ActualHeight properties of the Canvas object. This resulted in a very nice graph that adapted to my resizing of the window in real time without me having to do anything.

<Line Stroke="{Binding Ip, Converter={StaticResource TextToBrushConverter}}" StrokeThickness="2" >
    <Line.X1>
        <MultiBinding Converter="{StaticResource ResizeConverter}">
            <Binding Path="X"/>
            <Binding Path="ActualWidth" RelativeSource="{RelativeSource Mode=FindAncestor, AncestorType={x:Type Canvas}}"/>
        </MultiBinding>
    </Line.X1>
    <Line.Y1>
        <MultiBinding Converter="{StaticResource ResizeConverter}">
            <Binding Path="Y"/>
            <Binding Path="ActualHeight" RelativeSource="{RelativeSource Mode=FindAncestor, AncestorType={x:Type Canvas}}"/>
        </MultiBinding>
    </Line.Y1>
    <Line.X2>
        <MultiBinding Converter="{StaticResource ResizeConverter}">
            <Binding Path="X2"/>
            <Binding Path="ActualWidth" RelativeSource="{RelativeSource Mode=FindAncestor, AncestorType={x:Type Canvas}}"/>
        </MultiBinding>
    </Line.X2>
    <Line.Y2>
        <MultiBinding Converter="{StaticResource ResizeConverter}">
            <Binding Path="Y2"/>
            <Binding Path="ActualHeight" RelativeSource="{RelativeSource Mode=FindAncestor, AncestorType={x:Type Canvas}}"/>
        </MultiBinding>
    </Line.Y2>
</Line>

Performance

The next issue in the pipeline was performance. Clearing the GraphItems collection and adding new items to it was very slow and presented some ugly visual artifacts. For this I used the inner mechanisms of the BindingList object. First I set the RaiseListChangedEvents property to false, so that the list would not fire any events to the WPF mechanism. Then I cleared the list,added every newly calculated GraphItem to the list, set RaiseListChangedEvents back to true and fired a ListChanged event forcefully using the (badly named) ResetBindings method.

GraphItems.RaiseListChangedEvents = false;
GraphItems.Clear();
foreach (var item in items)
{
    GraphItems.Add(item);
}
GraphItems.RaiseListChangedEvents = true;
this.Dispatcher.Invoke(GraphItems.ResetBindings, DispatcherPriority.Normal);

All good, but then the overall performance of the application was abysmal. I would move to another program, then switch back to it and it wouldn't show up, or I would press a button and it wouldn't show up pressed, or the values of the data from the devices were not displayed sometimes. It wasn't that it used too much CPU or memory or anything like that, it was just a very sluggish user experience.

First idea was that the binding to the parent Canvas object to get the ActualWidth and the ActualHeight values was slow. I was right. In order to test this I removed any bindings to the Canvas and instead set the values directly to the converter, via the SizeChanged event of the Canvas object. This made things slightly faster, but also made them look weird, since I would resize the window and only see a difference after SizeChanged fired. The performance gain was significant, but not that large. The UI was still sluggish.

void Canvas_SizeChanged(object sender, SizeChangedEventArgs e)
{
    var resizeConverter = (ResizeConverter)this.Resources["ResizeConverter"];
    resizeConverter.Size = e.NewSize;
}

Now, you would ask yourself, what is the purpose of my using this ItemsControl and Canvas combination? It is in order to use the MVVM pattern. Just drawing directly on the Canvas would violate that, wouldn't it? Or would it? In this case the binding of the values in the viewmodel to the chart is one way. I only need to display stuff and nothing that happens on the chart UI changes the viewmodel. Also, since I chose to recreate all the chart items at every turn, it just means I am delegating clearing the Canvas and drawing everything to the WPF mechanism, nothing more. In fact, if I would just subscribe to the GraphItems ListChanged event I would be able to draw everything and not really have any strong link between data model and presentation. So I did that. The side effect of this was that I didn't need two ItemsControl/Canvas instances. I only needed one Canvas and I would add items to it as I saw fit.

Of course, the smart reader that you are, you realized that I need to know the type of the viewmodel in order to subscribe to the items list. The very correct way to do it would have been to encapsulate the Canvas into a control that would have received a list of items as a model and it would have handled all the drawing itself. It makes sense: you don't want a Canvas, what you really want is a Chart component that handles everything for you. I leave that to the enterprising reader, since it is outside the scope of this post.

Another thing that I did not do and it probably made sense in terms of performance, was to add items to the chart, somehow translate the position of the chart and remove the items that were outside the visible portion of the chart. That sounds like a good feature of the Chart control :) Again, I leave it to the reader to try to do something like that.

Bezier curves instead of lines

The last thing that I want to cover is making the chart less jagged. The roundtrip ping values were all over the place resulting in a jagged line kind of chart. I wanted something smoother, like a continuous curvy line. So I decided to replace the Line representation with a Bezier curve one. I am not a graphical person, neither a math geek. I had no idea what a Bezier curve is, only that it helps in creating these nice looking curves that blend into each other. Each Bezier curve is defined by four points so, in my ignorance, I thought that I just have to pass four points from the list instead of the two required to form a Line. The result was hilarious, but not what I wanted.

Reading the theory we learn that... what the hell is that on Wikipedia? How can anyone understand that?!... Ugh!

So let's start with some experiments. Let's use the wonderful XamlPadX application to see some examples of that using WPF. First, let's draw a jagged three line graphic and try to use the four points to define a Bezier curve and see what happens.

<Page xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation" xmlns:sys="clr-namespace:System;assembly=mscorlib" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" >
  <Canvas>
    <Line X1="100" Y1="100" X2="200" Y2="300" Stroke="Gray" StrokeThickness="2"/>
    <Line X1="200" Y1="300" X2="300" Y2="150" Stroke="Gray" StrokeThickness="2"/>
    <Line X1="300" Y1="150" X2="400" Y2="200" Stroke="Gray" StrokeThickness="2"/>
    <Path Stroke="Red" StrokeThickness="2">
        <Path.Data>
            <PathGeometry>
                <PathGeometry.Figures>
                    <PathFigureCollection>    
                        <PathFigure StartPoint="100,100">
                            <PathFigure.Segments>
                                <PathSegmentCollection>
                                    <BezierSegment Point1="200,300" Point2="300,150" Point3="400,200" />
                                </PathSegmentCollection>
                            </PathFigure.Segments>
                        </PathFigure>
                    </PathFigureCollection>
                </PathGeometry.Figures>
            </PathGeometry>
        </Path.Data>
    </Path>
</Canvas>
</Page>

As we can see, the curve does touch the first and the fourth points and sort of approximates the line, but not very clearly. The problem becomes even more obvious when we add another point and we create two Bezier curves, from the first and last four points. The two curves intersect, they are not continuous. Even if you take points four by four, the resulting curves, even if they continue each other, they do it with straight corners, the opposite of what I wanted.

Let's try the opposite, let's draw one Bezier curve and then lines that connect the first and second and then the third and fourth points. We see that the lines define tangents to the two arcs comprising the Bezier curve. That intuitively tells us something: if two Bezier curves would to seamlessly blend into each other, then the straight lines that define them would also have to be continuous. We try that in XamlPadX and yes! It works.

So, from this we learn something. First of all, the first and last points of the Bezier have to be the points used in a normal Line. Then the last two points need to be part of the same line for the first two points of the next curve. So what about the second and third points? How do I choose those? Can I choose any lines to define my curves? Thinking of the chart that I am looking for, I just want that the jagged edges turn into nice little curves. I also don't want to think of other points than the points that would normally define a single line, that means I shouldn't use future data in defining the middle points of the curve that defines current data. So I just made the decision to use only horizontal lines to define curves. That means for any pair of coordinates X1,Y1, X2,Y2 I would create four pairs like this: X1,Y1 X1+something,Y1 X2-something,Y2 X2,Y2. That value could be anything, but I've decided it would be a percentage of the horizontal distance between two points.

Final result: using a percentage, let's say 20%, I would turn the pair of coordinates into X1,Y1 X1+(X2-X1)*0.2 X1+(X2-X1)*(1-0.2) X2,Y2. Let's see how that looks on the original jagged line. Let's use 50% instead. And for some fun, let's put it to 80%, 100% and even 200%.

<Page xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation" xmlns:sys="clr-namespace:System;assembly=mscorlib" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" >
  <Canvas>
    <Line X1="100" Y1="100" X2="200" Y2="300" Stroke="Gray" StrokeThickness="2"/>
    <Line X1="200" Y1="300" X2="300" Y2="150" Stroke="Gray" StrokeThickness="2"/>
    <Line X1="300" Y1="150" X2="400" Y2="200" Stroke="Gray" StrokeThickness="2"/>
    <Path Stroke="Red" StrokeThickness="2">
        <Path.Data>
            <PathGeometry>
                <PathGeometry.Figures>
                    <PathFigureCollection>    
                        <PathFigure StartPoint="100,100">
                            <PathFigure.Segments>
                                <PathSegmentCollection>
                                    <BezierSegment Point1="150,100" Point2="150,300" Point3="200,300" />
                                </PathSegmentCollection>
                            </PathFigure.Segments>
                        </PathFigure>
                        <PathFigure StartPoint="200,300">
                            <PathFigure.Segments>
                                <PathSegmentCollection>
                                    <BezierSegment Point1="250,300" Point2="250,150" Point3="300,150" />
                                </PathSegmentCollection>
                            </PathFigure.Segments>
                        </PathFigure>
                        <PathFigure StartPoint="300,150">
                            <PathFigure.Segments>
                                <PathSegmentCollection>
                                    <BezierSegment Point1="350,150" Point2="350,200" Point3="400,200" />
                                </PathSegmentCollection>
                            </PathFigure.Segments>
                        </PathFigure>
                    </PathFigureCollection>
                </PathGeometry.Figures>
            </PathGeometry>
        </Path.Data>
    </Path>
</Canvas>
</Page>

That's it, folks. I hope you enjoyed this as much as I did and it helps you in future projects.

Postman has to abide to the Chrome security measures (meaning no Referer header)

Published Jun 29, 2015

Posted in
programming
software

and has 0 comments

Whenever you want to test a REST API, Postman is a great tool. It allows configuring all aspects of a request: Method (GET, POST, etc), Headers, keeps previous attempts in history, manages collections of requests and saves them and it is installed as a Chrome extension, bringing it only two clicks away. It does everything! ... or does it? Short story long: no!

Reported as a problem here: Referer header is not sent when set in Postman, the issue appears to be that some headers are "protected" by Chrome, therefore unusable. Well, it is a bug in the sense that Postman should tell you that when you write something there it is completely ignored! There is a solution, that can be found as a link in the bug report, but it involves installing other crap and running Python scripts. Ugh!

Here is a list of the Chrome protected headers:

Accept-Charset
Accept-Encoding
Access-Control-Request-Headers
Access-Control-Request-Method
Connection
Content-Length
Cookie
Cookie 2
Content-Transfer-Encoding
Date
Expect
Host
Keep-Alive
Origin
Referer
TE
Trailer
Transfer-Encoding
Upgrade
User-Agent
Via

So whenever you believe that some web site has used a magical solution to detect your sneaky attempts to access their web API or site and you are wondering what, just remember that it is most likely a Referer header that Postman (via Chrome) silently ignored.

Shrinking a T-SQL database using SQL commands (and encapsulating it into a stored procedure)

Published Jun 4, 2015

Posted in
database
programming

and has 2 comments

Update: This post discusses shrinking the data file of a Microsoft SQL database, caused in this case by misconfiguring the initial size of the database. For shrinking the log file one must at least use type 1, not 0, in the query. Also, a very pertinent comment from NULLable warns of the performance issues related to shrinking database files resulting from the fragmentation of the file.

I had this situation when the available space on the SQL database disk was less than the size of the database, in this case the temp database. Someone had wrongly configured the database to have an initial size of 64GB. Changing the size of the file in Microsoft SQL Management Studio doesn't work because it tries to create a different file, fill it with the data and then replace the file. No space for that. Also, it is damn slow, even if you have the space (I have no idea why). Shrink doesn't work either, because the database will not go smaller than the configured initial size. Time to do it command line style. Well, with sql queries, but you know what I mean.

The code for it goes like this:

USE [master];
GO

CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
DBCC FREESYSTEMCACHE ('ALL');
DBCC FREESESSIONCACHE;
GO

USE [tempdb]
GO

DBCC SHRINKFILE (tempdev, 3000);   --- New file size in MB

As you can see, you need to know not only the name of the database, but also the logical name of the database file that you want to shrink. It is not even a string, it is like a keyword in the DBCC SHRINKFILE command. Even if it does work, one would benefit from encapsulating it into a stored procedure. Here is the final code:

CREATE PROC ShrinkDatabase(@DbName NVARCHAR(100),@SizeMB INT)
AS
BEGIN


DECLARE @filename NVARCHAR(255)

DECLARE @sql NVARCHAR(Max) = 'SELECT @filename = dbf.name FROM ['+REPLACE(@DbName,'''','''''')+'].sys.database_files dbf WHERE dbf.[type]=0'
EXEC sp_executesql @sql,N'@filename NVARCHAR(255) OUTPUT',@filename OUTPUT

SET @sql='USE [master];
CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
DBCC FREESYSTEMCACHE (''ALL'');
DBCC FREESESSIONCACHE;'
EXEC sp_executesql @sql

SET @sql='USE ['+REPLACE(@DbName,'''','''''')+'];
DBCC SHRINKFILE ('+REPLACE(@filename,'''','''''')+', '+CONVERT(NVARCHAR(100),@SizeMb)+');'
EXEC sp_executesql @sql

END

Create it in the master database and use it like this:

EXEC master.dbo.ShrinkDatabase 'tempdb',3000

Take note that you cannot use this to "shrink up" the database. If the value you set is larger than the current size, the file will remain the same size as well as the setting for the initial size. Also take note of the fact that this stored procedure only shrinks the data file, not the log file (dbf.[type]=0).

Determine the best matching regular expression for a string

Published May 26, 2015

Posted in
.NET
programming

and has 0 comments

I had this situation where I had to match a string by several regular expressions and choose the most specific match. Let me explain it a little bit further: you have a string and several regular expressions that match that string. The requirement is to choose which one matches better. This would be useful if, for example, you want to match mobile manufacturers and so you define one configurations for Apple and another for Android and another for the rest. The regular expressions would be "Apple", "Android" and ".*" (which matches everything). So you would want to choose only "Apple" for Apple devices and not ".*".

Now, I've studied regular expressions a bit, but I don't want to implement a regular expression parser in order to compute the complexity of the expression tree. What I want is to have various Regex objects and, somehow, compare their complexity. I am not going to pretend that I understand what goes on within the Regex object, instead I will say that I noticed the complexity of a regular expression is directly proportional to the length of the pattern and, in the case of the Regex object, with the number of items in the _codes int32 array found in the code RegexCode object found as a private field of the Regex object.

So, the simplest code for that is this:

private double regexComplexity(string pattern)
{
    var reg = new Regex(pattern, RegexOptions.Singleline | RegexOptions.IgnoreCase);
    var codeField = reg.GetType().GetField("code", BindingFlags.Instance | BindingFlags.NonPublic);
    var code = codeField.GetValue(reg);
    var _codesField = code.GetType().GetField("_codes", BindingFlags.Instance | BindingFlags.NonPublic);
    var codes = (int[])_codesField.GetValue(code);
    return codes.Length;
}

However we must take into consideration several cases:

The regular expression pattern might be wrong
Reflection is slow

So, let's first catch all exceptions and use the length of the pattern as the complexity and then find a way to cache the complexity of a string, in case we use it more than once. Also, cache the FieldInfo objects for use at every call.

Here is the final code:

public static class RegularExpressionExtensions
{
    private static ConcurrentDictionary<string, int> _dict;
    private static FieldInfo _codeField;
    private static FieldInfo _codesField;

    static RegularExpressionExtensions()
    {
        _dict = new ConcurrentDictionary<string, int>();
        _codeField = typeof(Regex).GetField("code", BindingFlags.Instance | BindingFlags.NonPublic);
        _codesField = _codeField.FieldType.GetField("_codes", BindingFlags.Instance | BindingFlags.NonPublic);
    }

    public static int Complexity(this Regex reg)
    {
        if (reg == null) return 0;
        var pattern = reg.ToString();
        var complexity = _dict.GetOrAdd(pattern, p => getComplexity(reg));
        return complexity;
    }

    private static int getComplexity(Regex reg)
    {
        var code = _codeField.GetValue(reg);
        var codes = (int[])_codesField.GetValue(code);
        return codes.Length;
    }
}

Basically we encapsulate the Regex complexity in an static extension class that we can use on any Regex instance as reg.Complexity().

Pixelo Solver, now at Github

Published May 25, 2015

Posted in
misc
.NET
programming
C#
game
github

and has 0 comments

Today I published a raw version of a program that solves Pixelo puzzles, a Flash version of the game generally known as Nonogram, Picross, Hanjie, etc.. I was absolutely fascinated by the game, not only the concept, but also the attention to details that Tamaii lent to the game. The program is called Pixelo Solver and you can find it at Github, complete with source code.

I liked working on this because it covers several concepts that I found interesting:

Responding to global key combinations
Getting a snapshot of the screen and finding an object in it
Parsing a visual image for digits in a certain format
The algorithm for solving the puzzle in a reasonable amount of time and memory (it was not trivial)

Quick how-to: get all the files from here and copy them in a folder, then run PixeloTools.GameSolver.exe and press Ctrl-Shift-P on your puzzle in the browser.

Usage:

Start PixeloTools.GameSolver.exe
Open the browser to the Pixelo game
Start a puzzle
Press Ctrl-Shift-P

However, if the list of numbers per row or column is too long, the automated parsing will not work. In that case, you may use it offline, with a parameter that defines a file in the format:

5 1 10, 3 3 7, 2 5 4, 2 5 2, 2 1 1 1, 1 1 1 1 1, 1 1 3 1 1, 1 1 1 1 1, 1 1 1, 1 1, 1 1, 2 2 2 4, 2 1 1 1 2 4, 2 1 2 1, 7 4, 2 2 1 2 2 2, 2 1 1 1 1 1 1, 2 1 1 4 2, 1 3 4 2 1, 1 1
8 1 1, 5 2 2 1, 2 1 3, 1 1, 1 7 4 1, 3 5 1, 3 1 1 3, 4 3 4, 3 1 1 3, 3 5 1, 1 7 4 1, 1 2, 1 1 1, 2 2 2, 2 1 1 2, 2 1 2 2, 3 2 1, 3 1 1, 4 1 2 2, 9 2 3

where the first line is the horizontal lines and the second line the vertical lines. When the parsing fails, you still get something like this in the output of the program. You can just copy paste the two lines in a file, edit it so that it matches the puzzle, then run:

start "" PixeloTools.GameSolver.exe myFile.txt

The file can also be in XML format, same as in the file the game loads. That's for Tamaii's benefit, mostly.

Let me know what you think in the comments below.

Encapsulating indexes or creating a database that allows for fast inserts and updates as well as indexed queries

Published May 18, 2015

Posted in
database
programming

and has 0 comments

Relational databases have the great advantage of being able to use indexes. These are constructs that trade space and the speed of inserts and updates for query speed. This introduces a problem: what do you do when you have a high input and high output application? How can you make your database fast for both changes and queries?

The traditional solution is a write-database and a read-database. There are background services that ensure that the read-database is updated as fast as necessary with the new data, most of the time also doing extra computation and caching for the information one needs extracted. OLAP systems are a common solution, where the aggregated data required by various reports is computed and stored, ready to be read by your applications. I am not saying this is a bad idea, in fact I am saying this is the best idea for this scenario. The problem that I have with it is that you can hardly automate the process. You need to know what you want to read, you need to write the software to create the data and aggregate it into what you need.

So I decided to try to build a system that obeys the following rules:

The speed of insert and update operations needs to be unhindered by indexes. Indeed, changes to the original data should be avoided.
The select operation need to benefit from the indexing system.
The system must be able to determine by itself the data structures needed to cover the first two rules.
Optionally, there should be a way to encapsulate this new behaviour into a separate data structure from the original database.

In order to insert and update as fast as possible, I needed tables that only had primary keys, identity integers rather than uniqueidentifiers, as they lead to fragmentation of clustered indexes. In order to not only index the columns that are used in where and join conditions, but also encapsulate the behaviour in some other data structure, I decided to create shadow tables with the columns that I needed for querying. These tables I would then index in order to improve selects. The connection between the original insert table and the select table would be made via primary keys and the synchronization between the two types of tables would be done in the background, whenever needed. Best of all, analysis on execution plans could automatically determine the columns needed for this system and create the tables and indexes required, then suggest improvements on the observed queries.

In order to test my theory I created the following tables:

OriginalTable - with a primary key identity ingeniously called Id and two other columns called Latitude and Longitude, containing random decimal(18,6) values
LocationIndexTable - with a refId integer primary key column that links to the OriginalTable Id - without being a foreign key - and two Latitude and Longitude columns that are indexed by the same non clustered index
LocationIndexTable2 - with a refId integer primary key column that links to the OriginalTable Id - without being a foreign key - and a locId integer column that links to another table called LocationTable and has its own non clustered index
LocationTable - with a primary key identity column and Latitude and Longitude columns. I know that this looks identical to LocationIndexTable, but this is the more complex case where there are multiple records with the same location. LocationTable holds the distinct Location and Latitude values
LocationIndexTable3 - with a refId integer column that links to the OriginalTable Id and is indexed by its own nonclustered index - without being a foreign key - and two Latitude and Longitude columns that are indexed by a clustered index

With a number of 77179072 original table rows, I attempted the queries for each case, careful to clean the cache and memory buffers before that. Here are the results:

SELECT count(1) FROM OriginalTable WHERE Latitude BETWEEN 45.5 AND 46 AND Longitude BETWEEN 8.5 AND 9 - no indexes whatsoever. Result: 30 seconds
SELECT count(1) FROM OriginalTable ot INNER JOIN LocationIndexTable lit ON lit.RefId=ot.Id WHERE lit.Latitude BETWEEN 45.5 AND 46 AND lit.Longitude BETWEEN 8.5 AND 9. Result: 17 seconds
SELECT count(1) FROM OriginalTable ot INNER JOIN LocationIndexTable2 lit2 ON lit2.RefId=ot.Id INNER JOIN LocationTable lt ON lit2.LocId=lt.Id WHERE lt.Latitude BETWEEN 45.5 AND 46 AND lt.Longitude BETWEEN 8.5 AND 9. Result: 41 seconds
SELECT count(1) FROM OriginalTable ot INNER JOIN LocationIndexTable3 lit ON lit.RefId=ot.Id WHERE lit.Latitude BETWEEN 45.5 AND 46 AND lit.Longitude BETWEEN 8.5 AND 9. Result: 22 seconds

Unexpectedly for me, the most comfortable solution also seems the faster. Indeed, one issue is that I didn't have duplicate location data in the database, so there was no advantage in adding a new table to link locations to the original table. That means that in cases where the indexed data has many duplicates, it might be advantageous to experiment with a "distinct" table, although indexing should take this case into account, as well. The clustered index is slower than the unclustered one, that was a surprise. Also, the indexing just made the query twice as fast - I had expected more. Of course, all this benchmarking was done after deleting the cache and buffers with the commands DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS. It is interesting to see how fast they queries go without this clearing. The first unindexed query takes 3 or 4 seconds, while all the others take less than 2.

There is one more thing that needs to be addressed: moving these tables into another database. Are the indexes just as fast? They should be, but we must test interdatabase communication. This would allow to move the entire system outside a target database, truly encapsulated, really adding value to a database that remains unaffected. My tests show the following results: 28, 18, 29 and 18 seconds, respectively. Yes, you saw that right, the speed of the joins when the indexed databases are in another database on the same server seems faster! Just to make sure I reran the original tests on the same database and I got approximately the same results: 29,19,43 and 24, respectively. The only thing I didn't try (and I don't intend to) is to create primary-foreign key associations, as this means modifying the original tables, something I wish to avoid.

So, after all this I believe my idea has been validated. A lot more work has to be done in order to automate this system, but at least a no-effort manual solution is possible. Let's recap:

The speed of row inserts and updates remains unchanged: the only index is the original primary key identity integer key that should always exist in a table anyway.
The speed of selects is improved by creating tables that have an integer primary key that links to the original table, and only the columns used in queries, over which indexes are created.
Queries can be used to determine the columns needed to index. Even better, computed columns can be used instead of the original columns, which adds a little more performance.
Encapsulation is achieved by not only creating other tables for reading, but also moving these tables into another database, after which, unexpectedly, the performance is even better.

The end result is a system similar to indexing, but which takes space in another database and which is updated on demand, not when the system deems it necessary. Any SQL statements that worked before will continue to work unaffected, but faster alternatives for selects will become available. Overall, I am very satisfied, although I had expected better performance improvements than observed.

The correct log4Net RollingFileAppender configuration

Published Apr 20, 2015

and has 3 comments

I was trying to do a simple thing: configure a daily rolling log for log4net, meaning that I wanted that log files would be created daily and the name of the files would contain the date. The log was already configured and working with a normal FileAppender, so all I had to do was find the correct configuration. There are several answers on the Internet regarding this. I immediately went to the trusty StackOverflow and read the first answers, copy pasted lazily, and it seemed to work. But it did not. So warning, the first answer on StackOverflow about this is wrong.

So, my logs had to be named something like Application_20150420.log. That means several things:

The name of the appender class has to be set to log4net.Appender.RollingFileAppender
The name of the log files need to start with Application_ and end in .log
The name of the log files need to contain the date in the correct format
The logger needs to know that the files need to be created daily

This is the working configuration:

<appender name="FileAppender" type="log4net.Appender.RollingFileAppender">
     <filter type="log4net.Filter.LevelRangeFilter">
         <acceptOnMatch value="true" />
         <levelMin value="DEBUG" />
     </filter>

  <file type="log4net.Util.PatternString" value="c:\logfiles\Application_.log" />
  <appendToFile value="true" />
  <rollingStyle value="Date" />
  <datePattern value="yyyyMMdd" />
  <preserveLogFileNameExtension value="true"/>
  <staticLogFileName value="false" />

  <lockingModel type="log4net.Appender.FileAppender+MinimalLock"/>
  <layout type="log4net.Layout.PatternLayout">
    <conversionPattern value="%date %-5level %logger - %message%newline"/>
  </layout>
</appender>

As you can see, it is a little counter-intuitive. You do not need to specify the date in the file name, it will be added automatically, and you absolutely need to use preserveLogFileNameExtension, otherwise your files would look like Application_.log20140420

Blog Question: How to use a large image as a page background, without showing it loading slowly?

Published Mar 25, 2015

and has 0 comments

A blog reader asked me to help him get rid of the ugly effect of a large background image getting loaded. I thought of several solutions, all more complicated than the rest, but in the end settled on one that seems to be working well and doesn't require complicated libraries or difficult implementation: using the img onload event.

Let's assume that the background image is on the body element of the page. The solution involves setting a style on the body to hide it (style="display:none") then adding as child of the body an image that also is hidden and that, when completing loading, shows the body element. Here is the initial code:

<style>
body {
    background: url(bg.jpg) no-repeat center center fixed;
}
</style>
<body>

And after:

<style>
body {
    background: url(bg.jpg) no-repeat center center fixed;
}
</style>
<body style="display:none">
<img src="bg.jpg" onload="document.body.style.display=''" style="display:none;" />

This loads the image in a hidden img element and shows the body element when the image finished loading.

The solution might have some problems with Internet Explorer 9, as it seems the load event is not fired for images retrieved from the cache. In that case, a slightly more complex Javascript solution is needed as detailed in this blog post: How to Fix the IE9 Image Onload Bug. Also, in Internet Explorer 5-7 the load event fires for animated GIFs at every loop. I am sure you know it's a bad idea to have an animated GIF as a page background, though :)

Warning: While this hides the effect of slow loading background images, it also hides the page until the image is loaded. This makes the page appear blank until then. More complex solutions would show some simple html content while the page is loading rather than hiding the entire page, but this post is about the simplest solution for the question asked.

A more comprehensive analysis of image preloading, complete with a very nice Javascript code that covers a lot of cases, can be found at Preloading images using javascript, the right way and without frameworks

Blog question: Address validation using CRF models

Published Mar 21, 2015

Posted in
question
programming

and has 0 comments

I am starting a new blog series called Blog Question, due to the successful incorporation of a blog chat that works, is free, and does exactly what it should do: Chatango. All except letting me know in real time when a question has been posted on the chat :( . Because of that, many times I don't realize that someone is asking me things and so I fail to answer. As a solution, I will try to answer questions in blog posts, after I do my research. The new label associated with these posts is 'question'.

First off, some assumptions. I will assume that the person who said I'm working on this project of address validation. Using crf models is my concern. was talking about Conditional Random Fields and he meant postal addresses. If you are reading this, please let me know if that is correct. Also, since I am .NET developer, I will use concepts related to .NET.

I knew nothing about CRFs before writing this posts, so bear with me. The Wikipedia article about them is hard to understand by anyone without mathematical (specifically probabilities and statistics) training. However the first paragraph is pretty clear: Conditional random fields (CRFs) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. Whereas an ordinary classifier predicts a label for a single sample without regard to "neighboring" samples, a CRF can take context into account. It involves a process that classifies data by taking into account neighboring samples.

A blog post that clarified the concept much better was Introduction to Conditional Random Fields. It describes how one uses so called feature functions to extract a score from a data sample, then aggregates scores using weights. It also explains how those weights can be automatically computed (machine learning).

In the context of postal address parsing, one would create an interface for feature functions, implement a few of them based on domain specific knowledge, like "if it's an English or American address, the word before St. is a street name", then compute the weighting of the features by training the system using a manually tagged series of addresses. I guess the feature functions can ignore the neighboring words and also do stuff like "If this regular expression matches the address, then this fragment is a street name".

I find this concept really interesting (thanks for pointing it out to me) since it successfully combines feature extraction as defined by an expert and machine learning. Actually, the expert part is not that relevant, since the automated weighing will just give a score close to 0 to all stupid or buggy feature functions.

Of course, one doesn't need to do it from scratch, other people have done it in the past. One blog post that discusses this and also uses more probabilistic methods specifically to postal addresses can be found here: Probabilistic Postal Address Elementalization. From Hidden Markov Models, Maximum-Entropy Markov Models, Transformation-Based Learning and Conditional Random Fields, she found that the Maximum-Entropy Markov model and the Conditional Random Field taggers consistently had the highest overall accuracy of the group. Both consistently had accuracies over 98%, even on partial addresses. The code for this is public at GitHub, but it's written in Java.

When looking around for this post, I found a lot of references to a specific software called the Stanford Named Entity Recognizer, also written in Java, but which has a .NET port. I haven't used the software, but it seems as it is a very thorough implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION). Perhaps this would also come in handy.

This is as far as I am willing to go without discussing existing code or actually writing some. For more details, contact me and we can work on it.

More random stuff:
The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum entropy Markov models (MEMMs) and other conditional Markov models based on directed graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world sequence labeling tasks. - from Conditional Random Fields: An Introduction

Tutorial on Conditional Random Fields for Sequence Prediction

CRFsuite - Documentation

Extracting named entities in C# using the Stanford NLP Parser

Tutorial: Conditional Random Field (CRF)

Change all column default values with a certain value to another value in T-SQL

Published Mar 19, 2015

Posted in
database
programming

and has 0 comments

Just to remember this for future work. I wanted to replace GetDate() default column values with SysUtcDatetime(). This is the script used:

-- declare a string that will hold the actual SQL executed
DECLARE @SQL NVARCHAR(Max) = ''
SELECT @SQL=@SQL+
N'ALTER TABLE ['+t.name+'] DROP CONSTRAINT ['+o.name+'];
ALTER TABLE ['+t.name+'] ADD DEFAULT SYSUTCDATETIME() FOR ['+c.name+'];
'  -- drop the default value constraint, then add another with SYSUTCDATETIME() as default value
 FROM sys.all_columns c  -- get the name of the columns
INNER JOIN sys.tables t  -- get the name of the tables containing the columns
ON c.object_id=t.object_id
INNER JOIN sys.default_constraints o -- we are only interested in default value constraints
ON c.default_object_id=o.object_id
WHERE o.definition='(getdate())' -- only interested in the columns with getdate() as default value

-- execute generated SQL
EXEC sp_executesql @SQL

T-SQL: Inconsistent behavior between parameters and XML value

Published Mar 18, 2015

Posted in
database
programming
XML

and has 0 comments

Recently I created a framework for translating JSON requests from a REST API to entities sent to the database. For simple objects, it was very easy, just create an SQL parameter for each property. However, for complex objects - having other objects as properties - this was not a solution. So I used a DataContractSerializer to transform the object to XML, send it as an XML SQL parameter and get the values from it in the stored procedures. Then I noticed date time inconsistencies between the two approaches. What was going on?

Let's start with the code. The DateTime object created from the JSON is a date and time value with a timezone, like 16:00 UTC+1. That is 15:00 in universal time. One you send it as a parameter for a stored procedure, the value received by the stored procedure is 16:00 (the server has the same timezone). In SQL Server, DATETIME and DATETIME2 types don't store timezone information. However, when sent through XML, the value looks like this: 2015-03-09T16:00:0.0000000+01:00. Using SELECT [Time] = T.Item.value('@Time','DATETIME2') FROM @Xml.nodes('//Location/SDO') as T(Item), the value returned is 15:00! You get 16:00+01 if you translate to DATETIMEOFFSET.

So let's recap: When you send a DateTime with timezone offset as an SQL parameter, the value reaching the SQL server is the local time. When you extract a textual value with timezone offset from an XML into a DATETIME, using the .value method, the value you get back is basically the Universal Time.

Solutions? Well, if you are going to use DateTime, you might as well consider that servers and devices are not always in the same timezone. Always translating values to universal time might be a good idea. Another solution is to extract from XML to a DATETIMEOFFSET type, which holds both the datetime and the timezone offset. Converting that value to DATETIME or DATETIME2 removes the timezone (Warning: it does NOT give the local time, unless you are in the same zone as the timezone in the datetimeoffset value).

"The transaction associated with the current connection has completed but has not been disposed." Well, not really

Published Feb 27, 2015

and has 0 comments

I was trying to solve another problem, that operations creating, altering or dropping databases cannot be made inside a .Net transaction. Therefore I created a new TransactionScope inside the main one, using the TransactionScopeOption.Suppress option, but then I started getting this weird exception: The transaction associated with the current connection has completed but has not been disposed. The transaction must be disposed before the connection can be used to execute SQL statements. But I did Complete and Dispose all my transaction scopes, so what was going on? Long story short, this is really confusing message for a simple problem: your transaction probably timed out. Create the transaction scope with a large TimeSpan as the Timeout and it will get through. If you can, you can use a try/catch/finally block in which you Dispose the transaction (just remember that a using construct is basically a try/finally block). In my case, I was conditionally creating this new TransactionScope, so I wasn't using using. The transaction would fail, bleed into the other transaction where the confusing exception was being thrown.

T-SQL AVG function fails when SUM fails

Published Feb 26, 2015

Posted in
database
programming

and has 0 comments

I have another SQL quirk for you today, particularly in Microsoft's SQL Server. It concerns the AVG function when the SUM of the averaged values would result in an overflow for that type: AVG fails with the same error. Let's imagine for a moment that we wouldn't have an AVG function. In that case, our code would use SUM and COUNT to average values. For the integer column x, the naive SUM(x)/COUNT() would fail when the sum goes over the INT maximum value (I am not sure that SUM should fail, either, but that is another discussion). So our solution would be something like CONVERT(INT,SUM(CONVERT(BIGINT,x)))/COUNT(). Obviously, this is the solution for the AVG issue, just average the values converted to the biggest type available. So it seems that in their flagship SQL server, Microsoft implemented the naive version, hmm... To be fair, this behaviour is documented in the T-SQL function page: If the sum exceeds the maximum value for the data type of the return value an error will be returned.

As I see it, it is a bug, as this should have been handled internally, and one is logged in Microsoft Connect here: AVG function causes Arithmetic overflow error. Read the funny comment from some Microsoft employee who says there is a solution proposed, but they don't have time to implement it in SQL Server 2008, so probably they will do that in a future version. I'm trying on SQL Server 2014; it's 7 years later, Microsoft!!

Just for kicks, I tried the same in MySQL and it works there. More than that, it doesn't fail for SUM, either, as it automatically returns a BIGINT value, pushing the issue to an eventual insert or update in an INT column.

NuGet insanity (Could not load file or assembly 'System.Web.WebPages, Version=3.0.0.0...)

Published Feb 16, 2015

and has 0 comments

I am relatively new to the entire NuGet ecosystem. What I expected is for things to just work. You know... Microsoft. However the web of interdepencies seems to be too much even for them. The problems that appear when updating MVC versions, .NET Framework versions, etc, are as annoying as they are unclear. One example: I was trying to publish a project that worked perfectly on my system. I moved it to the server machine, and weird things began to happen. The most annoying of them all is that the errors that occur do that at runtime instead of at compile time. Such an error was "Could not load file or assembly System.Web.WebPages, Version=3.0.0.0...". The project reference for WebPages was 2.0.0.0. If I removed it and tried to add a reference, only versions 1.0.0.0 and 2.0.0.0 were available. Meanwhile Razor was complaining that it didn't find the 3.0.0.0 version of WebPages.

Long story short: don't try to resolve only the library dependencies, but also the framework dependencies. In this case, System.Web.WebPages 3.0.0.0 is only available for the .NET framework 4.5.1. The project was configured as 4.5. Updating the MVC framework after the change to .NET 4.5.1 solved it.

Steps:

Change the project version to 4.5.1 (or whatever the newest usable .NET framework version)
go to the NuGet Package Manager Console in Visual Studio
Run command Update-Package -reinstall Microsoft.AspNet.Mvc

This, of course, is not a panacea for all problems, but just remember that the .NET framework is important in these scenarios.

Always send the CORS headers with your HTTP error responses

Published Feb 12, 2015

Posted in
browser
programming

and has 2 comments

I met this situation where I wanted to implement an http interceptor, a piece of JavaScript code that would do something on unauthorized access to my API. The way to do this is irrelevant (and different from jQuery and AngularJS sites), but there is a problem that affects every situation and that is when you access the API from a different domain than the API's. You see, the browser needs the API server to authorize CORS for every Ajax request that accesses that server from another domain. You might think you did that already in your API, but let me ask you: when there is a problem, like not authorized access, are you sending CORS headers with your error response? Because if you do not, everything you send, including the http code, will not be parsed by the browser and any interception will just show a code of 0. The situation is a little confounded by the fact that the browser does announce that you have a CORS access problem, but also displays the status message, which in this case would be "Unauthorized access". This might make you think you can access that message or the status code. Well, you cannot.

The solution is to send the CORS headers: Access-Control-Allow-Origin, Access-Control-Allow-Methods, Access-Control-Allow-Headers. Just set their value to "*" and send them with your error response and now your JavaScript can read the error code and message.