Regular expressions in Java

Published Feb 14, 2017

Posted in
Java
programming

Ugh! I will probably be working in Java for a while. Or I will kill myself, one of the two. So far I hate Eclipse, I can't even write code and I have to blog simple stuff like how to do regular expressions. Well, take it as a learning experience! Exiting my comfort zone! 2017! Ha, ha haaaaa [sob! sob!]

In Java you use Pattern to do regex:

Pattern p = Pattern.compile("a*b");
Matcher m = p.matcher("aaaaab");
boolean b = m.matches();

So let's do some equivalent code:

.NET:

var regDate=new Regex(@"^(?<year>(?:19|20)?\d{2})-(?<month>\d{1,2})-(?<day>\d{1,2})$",RegexOptions.IgnoreCase);
var match=regDate.Match("2017-02-14");
if (match.Success) {
  var year=match.Groups["year"];
  var month=match.Groups["month"];
  var day=match.Groups["day"];
  // do something with year, month, day
}

Java:

Pattern regDate=Pattern.compile("^(?<year>(?:19|20)?\\d{2})-(?<month>\\d{1,2})-(?<day>\\d{1,2})$", Pattern.CASE_INSENSITIVE);
Matcher matcher=regDate.matcher("2017-02-14");
if (matcher.find()) {
  String year=matcher.group("year");
  String month=matcher.group("month");
  String day=matcher.group("day");
  // do something with year, month, day
}

Notes:

The first thing to note is that there is no verbatim literal support in Java (the @"string" format from .NET) and there is no "var" (one always has to specify the type of the variable, even if it's fucking obvious). Second, the regular expression object Pattern doesn't do things directly, instead it creates a Matcher object that then does operations. The two bits of code above are not completely equivalent, as the Success property in a .NET Match object holds the success of the already performed operation, while .find() in the Java Matcher object actually performs the match.

Interestingly, it seems that Pattern is automatically compiling the regular expression, something that .NET must be directed to do. I don't know if the same term means the same thing for the two frameworks, though.

Another important thing is that it is more efficient to reuse matchers rather than recreate them. So when you want to use the matcher on another string, use matcher.reset("newstring").

And lastly, the string class itself has quick and dirty regular expression methods like .matches, replaceFirst and .replaceAll. The matches method only returns a bool if the string is a perfect match (equivalent to a Pattern match with ^ at the beginning and $ at the end).

Javascript debounce function

Published Jan 25, 2017

and has 0 comments

Often we need to attach functions on Javascript events, but we need them to not be executed too often. Mouse move or scroll events can fire several times a second and executing some heavy computation directly would make everything slow and unresponsive. That's why we use a method called debounce, that takes your desired function and returns another function that will only get executed so many times in a time interval.

Long story short, here is the code:

function debounce(fn, wait) {
    var timeout=null;
    var c=function(){ clearTimeout(timeout); timeout=null; };
    var t=function(fn){ timeout=setTimeout(fn,wait); };
    return function() {
        var context=this;
        var args=arguments;
        var f=function(){ fn.apply(context,args); };
        if (!timeout) {
            t(c);
            f();
        } else {
            c();
            t(f);
        }
    }
}

with a more compressed ES6 version here.

It has reached a certain kind of symmetry, so I like it this way. Let me explain how I got to it.

First of all, there is often a similar function used for debouncing. Everyone remembers it and codes it off the top of the head, but it is partly wrong. It looks like this:

function debounce(fn, wait) {
    var timeout=null;
    return function() {
        var context=this;
        var args=arguments;
        var f=function(){ fn.apply(context,args); };
        clearTimeout(timeout);
        timeout=setTimeout(f,wait);
    };
}

It seems OK, right? Just extend the time until the function gets executed. Well, the problem is that the first time you call the function you will have to wait before you see any result. If the function is called more often than the wait period, your code will never get executed. That is why Google shows this page as *the* debounce reference: JavaScript Debounce Function. And it works, but good luck trying to understand its flow so completely that you can code it from memory. My problem was with the callNow variable, as well as the rarity of cases when I would need to not call the function immediately the first time, thus making the immediate variable redundant.

So I started writing my own code. And it looked like the "casual" debounce function, with an if block added. If the timeout is already set, then just reset it; that's the expected behavior. When isn't this the expected behavior? When calling it the first time or after a long period of inactivity. In other words when the timeout is not set. And the code looked like this:

function debounce(fn, wait) {
    var timeout=null;
    return function() {
        var context=this;
        var args=arguments;
        var f=function(){ fn.apply(context,args); };
        if (timeout) {
            clearTimeout(timeout);
            timeout=setTimeout(f,wait);
        } else {
            timeout=setTimeout(function() {
                clearTimeout(timeout);
                timeout=null;
            });
            f();
        }
    };
}

The breakthrough came with the idea to use the timeout anyway, but with an empty function, meaning that the first time it is called, the function will execute your code immediately, but also "occupy" the timeout with an empty function. Next time it is called, the timeout is set, so it will be cleared and reset with a timeout using your initial code. If the interval elapses, then the timeout simply gets cleared anyway and next time the call of the function will be immediate. If we abstract the clearing of timeout and the setting of timeout in the functions c and t, respectively, we get the code you saw at the beginning of the post. Note that many people using setTimeout/clearTimeout are in the scenario in which they set the timeout immediately after they clear it. This is not always the case. clearTimeout is a function that just stops a timer, it does not change the value of the timeout variable. That's why, in the cases when you want to just clear the timer, I recommend also setting the timeout variable to null or 0.

For the people wanting to look cool, try this version:

function debounce(fn, wait) {
    var timeout=null;
    var c=function(){ clearTimeout(timeout); timeout=null; };
    var t=function(fn){ timeout=setTimeout(fn,wait); };
    return function() {
        var context=this;
        var args=arguments;
        var f=function(){ fn.apply(context,args); };
        timeout
            ? c()||t(f)
            : t(c)||f();
    }
}

Now, doesn't this look sharp? The symmetry is now obvious. Based on the timeout, you either clear it immediately and time out the function or you time out the clearing and execute the function immediately.

Update 26 Apr 2017: Here is an ES6 version of the function:

function debounce(fn, wait) {
    let timeout=null;
    const c=()=>{ clearTimeout(timeout); timeout=null; };
    const t=fn=>{ timeout=setTimeout(fn,wait); };
    return ()=>{
        const context=this;
        const args=arguments;
        let f=()=>{ fn.apply(context,args); };
        timeout
            ? c()||t(f)
            : t(c)||f();
    }
}

Need help solving a code design problem: API validation

Published Jan 3, 2017

and has 2 comments

I am working on this issue that I can't seem to properly be able to solve. No matter what I do I can't shake the feeling of code smell from it. It involves a chain of validations and business operations that I want to separate. The simple scenario is this: I have a web API that receives a data transfer object and needs to perform some action, but then it gets murkier and murkier.

Simply code

The first implementation is pretty obvious. I get the DTO, perform checks on it, access the database to get more info, perform more checks, do more work until everything is done. It's easy to do, clear imperative linear programming, it does the job. However I get a bad smell from the code because I have mashed together validation and business logic. I would also like to reuse the validation in other areas of the project.

Manager and Validator

So the natural improvement is to create two flows instead of one, each with their own concern: a validator will perform checks so I know an operation is valid, then the manager performs the operation. But this soon hits a snag. Let say I want to validate that a record exists from the id that I send as a property in the DTO. If I do this check in the validator, then it accesses the database. It stinks! I should probably get the record with the manager, but then I have to validate that the record was returned and that the data in it is valid in the context of my DTO. What do I do?

Manager and Validator classes

OK, so I can split the validation and operational flows in methods that will live in their respective classes. In this scenario a validator checks the record id is not malformed, a manager attempts to get the record, then the validator executes another method to check the record exists and conforms to the expectations, and so on. Problem solved, right? But no. That means I have to pass the record to the validating method, or any number of records that I need to perform the validation. For example I want to check a large chain of records that need to be validated based on conditions in the DTO. I could get the list of all records, pass them to the validator and then check if they exist, but what if the database is very large?

Manager methods that help validation and Validator

Wait, I can be smart about it. I know that essentially I want to get a list of records then validate something. Can't I just create a specific method in the manager that gets only the records involved in validation? Just do manager.GetRelatedRecords(dto.Id) to get a list of records from the database, then pass them as parameters to the validator methods. Fixed it, right? Nope. What if there is a breaking condition that is related to validation? The original linear method would just go through the steps and exit when finished. It wouldn't get a bunch of unnecessary records then fail at the first later on. And there is even a worse problem: what if while I was busy getting the records and preparing for validation the records have changed? I am basically introducing a short lived cache, with all the problems arising from it, like concurrency and invalidation.

Transactions

Hey, I can just new TransactionScope the shit out of it, right? Do steps of Validate, GetMoreData and repeat all inside a transaction. Anything fails, just rollback, what is the big deal? Well, concurrent access being slowed or deadlocks would be problematic, but more than that I am terrified of the result. I started with a linear flow in a single method and now I have two classes, each with a lot of methods which get more and more parameters as I move on: first I validate the DTO values, then a record in relation to the DTO, then two records in relation to each other and the DTO and so on. All these methods are being executed inside a transaction using a local ad-hoc memory cache to hold all the data I presume I will need. The manager methods are either using a big list of parameters themselves or repeatedly instantiating the same providers for the data they get. It's a lot more complex than the original, working design. It's also duplicated flow, as it needs to at least partially go through the data to populate what I need, then it goes through the same flow during the validation.

Context

Ha! I know! Instead of passing an increasing number of parameters to validation methods, create a single context object in which I just populate properties as I go. No more mega methods. Anything can be solved by adding another layer of indirection, correct? Well, now I have three classes, not two, with the context class practically constituting a replacement of the local scope in the original method. The validation methods are now completely obscure unless you look inside them: they all receive the DTO and the context class. Who knows which properties in the context are being used and how?

Interfaces and Flow

Ok, maybe I rushed into things. Sometimes several layers of indirection are required to solve a single problem. In this case I will create an interface with the properties and methods used by every validation method. So the name validation will look like ValidateName(MyDTO dto, INameContext nameContext), the graph validation will look like ValidateGraph(MyDTO dto, IGraphStructure graph) and my context class will implement INameContext and IGraphStructure, even when the two interfaces have common members. I will do the same with the manager class, which will implement the interfaces required to populate the specific context interfaces.

I am not kidding!

This isn't one of those "how the junior and the senior write the same code" jokes. I actually don't know what to do. The most "elegant" solution turned a working method from an API class into three classes, transactions, a dozen interfaces and, best of all, still keeping the entire logical flow in the original method. You know how when a code block becomes too complex it needs to be split into several short methods, right? Well I feel like in this case I somehow managed to split each line in the original block into several methods. I can see the Medium article title now: "If your method has one line of code, it's too long!".

Unit Tests

Any software developer worth his salt will write unit tests, especially for a code such as this. That means any large validation method will need to be split just in order to not increase unit test size exponentially. Can you imagine writing unit tests for the methods that use the huge blob-like context class?

Cry for help

So help me out here, is there an general, elegant but simple solution to API separation of concerns in relation to validation?

Deserializing base type property with JSON.Net and having it work in WCF as well

Published Dec 24, 2016

Posted in
.NET
programming
C#
github

and has 0 comments

Sometimes we want to serialize and deserialize objects that are dynamic in nature, like having different types of content based on a property. However, the strong typed nature of C# and the limitations of serializers make this frustrating.

In this post I will be discussing several things:

How to deserialize a JSON property that can have a value of multiple types that are not declared
How to serialize and deserialize a property declared as a base type in WCF
How to serialize it back using JSON.Net

The code can all be downloaded from GitHub. The first phase of the project can be found here.

Well, the scenario was this: I had a string in the database that was in JSON format. I would deserialize it using JSON.Net and use the resulting object. The object had a property that could be one of several types, but no special notation to describe its concrete type (like the $type notation for JSON.Net or the __type notation for DataContractSerializer). The only way I knew what type it was supposed to be was a type integer value.

My solution was to declare the property as JToken, meaning anything goes there, then deserialize it manually when I have both the JToken value and the integer type value. However, passing the object through a WCF service, which uses DataContractSerializer and which would throw the exception System.Runtime.Serialization.InvalidDataContractException: Type 'Newtonsoft.Json.Linq.JToken' is a recursive collection data contract which is not supported. Consider modifying the definition of collection 'Newtonsoft.Json.Linq.JToken' to remove references to itself..

I thought I could use two properties that pointed at the same data: one for JSON.Net, which would use JToken, and one for WCF, which would use a base type. It should have been easy, but it was not. People have tried this on the Internet and declared it impossible. Well, it's not!

Let's work with some real data. Database JSON:

{
  "MyStuff": [
    {
      "Configuration": {
        "Wheels": 2
      },
      "ConfigurationType": 1
    },
    {
      "Configuration": {
        "Legs": 4
      },
      "ConfigurationType": 2
    }
  ]
}

Here is the code that works:

[DataContract]
[JsonObject(MemberSerialization.OptOut)]
public class Stuff
{
    [DataMember]
    public List<Thing> MyStuff;
}

[DataContract]
[KnownType(typeof(Vehicle))]
[KnownType(typeof(Animal))]
public class Configuration
{
}

[DataContract]
public class Vehicle : Configuration
{
    [DataMember]
    public int Wheels;
}

[DataContract]
public class Animal : Configuration
{
    [DataMember]
    public int Legs;
}

[DataContract]
public class Thing
{
    private Configuration _configuration;
    private JToken _jtoken;

    [DataMember(Name = "ConfigurationType")]
    public int ConfigurationType;

    [IgnoreDataMember]
    public JToken Configuration
    {
        get
        {
            return _jtoken;
        }
        set
        {
            _jtoken = value;
            _configuration = null;
        }
    }

    [JsonIgnore]
    [DataMember(Name = "Configuration")]
    public Configuration ConfigurationObject
    {
        get
        {
            if (_configuration == null)
            {
                switch (ConfigurationType)
                {
                    case 1: _configuration = Configuration.ToObject<Vehicle>(); break;
                    case 2: _configuration = Configuration.ToObject<Animal>(); break;
                }
            }
            return _configuration;
        }
        set
        {
            _configuration = value;
            _jtoken = JRaw.FromObject(value);
        }
    }
}

public class ThingContractResolver : DefaultContractResolver
{
    protected override JsonProperty CreateProperty(System.Reflection.MemberInfo member,
            Newtonsoft.Json.MemberSerialization memberSerialization)
    {
        var prop = base.CreateProperty(member, memberSerialization);
        if (member.Name == "Configuration") prop.Ignored = false;
        return prop;
    }
}

So now this is what happens:

DataContractSerializer ignores the JToken property, and doesn't throw an exception
Instead it takes ConfigurationObject and serializes/deserializes it as "Configuration", thanks to the KnownTypeAttributes decorating the Configuration class and the DataMemberAttribute that sets the property's serialization name (in WCF only)
JSON.Net ignores DataContract as much as possible (JsonObject(MemberSerialization.OptOut)) and both properties, but then un-ignores the Configuration property when we use the ThingContractResolver
DataContractSerializer will add a property called __type to the resulting JSON, so that it knows how to deserialize it.
In order for this to work, the JSON deserialization of the original text needs to be done with JSON.Net (no __type property) and the JSON coming to the WCF service to be correctly decorated with __type when it comes to the server to be saved in the database

If you need to remove the __type property from the JSON, try the solution here: JSON.NET how to remove nodes. I didn't actually need to do it.

Of course, it would have been simpler to just replace the default serializer of the WCF service to Json.Net, but I didn't want to introduce other problems to an already complex system.

A More General Solution

To summarize, it would have been grand if I could have used IgnoreDataMemberAttribute and JSON.Net would just ignore it. To do that, I could use another ContractResolver:

public class JsonWCFContractResolver : DefaultContractResolver
{
    protected override JsonProperty CreateProperty(System.Reflection.MemberInfo member,
        Newtonsoft.Json.MemberSerialization memberSerialization)
    {
        var prop = base.CreateProperty(member, memberSerialization);
        var hasIgnoreDataMember = member.IsDefined(typeof(IgnoreDataMemberAttribute), false);
        var hasJsonIgnore = member.IsDefined(typeof(JsonIgnoreAttribute), false);
        if (hasIgnoreDataMember && !hasJsonIgnore)
        {
            prop.Ignored = false;
        }
        return prop;
    }
}

This class now automatically un-ignores properties declared with IgnoreDataMember that are also not declared with JsonIgnore. You might want to create your own custom attribute so that you don't modify JSON.Net's default behavior.

Now, while I thought it would be easy to implement a JsonConverter that works just like the default one, only it uses this contract resolver by default, it was actually a pain in the ass, mainly because there is no default JsonConverter. Instead, I declared the default contract resolver in a static constructor like this:

JsonConvert.DefaultSettings = () => new JsonSerializerSettings
{
    ContractResolver=new JsonWCFContractResolver()
};

Now the JSON.Net operations don't need an explicitly declared contract resolver. Updated source code on GitHub

Links that helped

Configure JSON.NET to ignore DataContract/DataMember attributes
replace WCF built-in JavascriptSerializer with Newtonsoft Json.Net json serializer
Cannot return JToken from WCF Web Service
Serialize and deserialize part of JSON object as string with WCF

Well organized ways of doing software development

Published Dec 10, 2016

Posted in
programming
process

and has 0 comments

I am not the manager type: I do software because I like programming. Yet lately I found myself increasingly be the go-to guy for development processes. And I surprisingly knew of what people were asking of me. And it's true, no matter how crappy your jobs are and how little you get to do something truly impressive or challenging, you always gain ... operational experience. I've been working in almost every software environment possible: chaotic, amateurish, corporate, agile, state and everything in between. Therefore I decided to write a blog post about what I thought worked in the development process, rather than talking about code.

First things first

The first thing I need to say is that process, no matter which one you choose, will never work without someone to manage it and perhaps even enforce it. Even a team that agrees unanimously on how they are going to organize the work will slack off eventually. Process is never pleasant to define and follow to the letter, but it is absolutely necessary. In Scrum, for example, they defined the role of Scrum Master exactly for this reason. Even knowing fully that Scrum helps the team, developers and managers alike would cut corners and mess it all up.

The second thing I feel important to put out there before I describe a working process is that doing process for the sake of process is just stupid. The right way to think about it is like playing a game. The best way to play it may be determined by analyzing its rules and planning the actions that will benefit the player most, but no one will ever play the game if it's not fun. Therefore the first thing when determining your process is to make it work for your exact team. The second is to follow the plan as well as possible, because you know and your team knows that is the best way to achieve your goals.

A process

From my personal experience, the best way to work in a team was when we were using Scrum. I have no experience with Extreme Programming or Kanban, though, so I can't tell you which one is "best". As noted above, the best is determined by your team composition, not by objective parameters. For me Scrum was best for two main reasons.

Planning was done within the team, meaning that we both estimated and discussed the issues of the following sprint together. There was no one out of the loop, unless they chose to not pay attention, and it was always clear what we were meant to do. As with every well implemented Scrum process, we had all the development steps for our tasks done during the same sprint: planning, development, unit testing, code review, testing, demo, refactoring and documenting. At the end of the sprint we were confident we did a good job (or aborted the sprint) and that we had delivered something, no matter how trivial, and showed it working. This might not seem so important, but it is, for morale reasons. At the end of the sprint the team was seeing the fruits of their labor. So this is one reason: planned things together, made them work, showed them working and saw the results of our labor and got feedback for it.

The second reason is specific to Scrum because Scrum sprints are timeboxed. You either do what you planned to do in that time or you abort the sprint. This might sound like some pedantic implementation of a process that doesn't take into account the realities of software development, but it's not. Leaving aside the important point that one cannot judge results of work that changed direction since planning, timeboxing makes it difficult and painful to change things from the original planning. That's the main reason for it, in my opinion, and forces the team to work together from beginning to end. The product owner will not laze out and give you a half thought feature to implement because they know it will bite them in the ass if the sprint aborts midway. The developers will not postpone things because they feel they have other things more important to implement. The managers attached to the project will get that nice report they all dream about: starting with clear goals and deadlines and finishing with demonstrable results.

Personally, I also like timeboxing because I get some stuff to do and when I get it done I can do whatever the hell I want. There is no pressure to work withing a fixed daily schedule and encourages business owners to think outside the box, leave people working in their own ways as long as they produce the required results. I liked the philosophy of Scrum so much that I even considered using it for myself, meaning a Scrum of one person, working for my own backlog. It worked for a while quite well, but then I cut corners and slowly abandoned it. As I said before, you need someone to account for the proper implementation of the process. I didn't have it in me to be that person for myself. Not schizoid enough, I guess. Or my other personalities just suck.

Tooling

Process can be improved significantly by tooling. The friction caused by having to follows rules (any rules) can be lessened a lot by computer programs that by definition do just that: follow rules. There is a software developer credo that says "automate everything that you can", meaning you let the computer do as much as possible for you, the rest being what you use your time for. Do the same for your development team. Use the proper tools for the issue at hand.

A short list of tools that I liked working with:

Microsoft Visual Studio - great IDE for .NET and web development
JetBrains ReSharper - falling in and out of love with it on a regular basis, it feels like magic in the hands of a developer. It also helps with a lot of the administrative part of development like company code practices, code consistency, overall project analysis and refactoring.
Atlassian JIRA - "The #1 software development tool used by agile teams". It helps with tasks, management, bugs, reporting. It's web based and blends naturally with collaborative work for teams of any size. It allows for both project management and bug tracking, which is great.
Atlassian Confluence - an online document repository. It works like a wiki, with easy to use interface for creating documents and a lot of addons for all kinds of things.
Smartbear Code Collaborator - it makes code reviews easy to create, track and integrate.
Version control system - this is not a product, it's a class of software. I didn't want to recommend one because people are very attached and vocal about their source control software. I worked with Perforce, SVN, Git, etc. Some are better than others, but it's a long discussion not suitable for this post. Having source control, though, is necessary for any development project.
Some chat system - one might think this is trivial, but it's not. People need to be able to communicate instantly without shouting over their desks. Enough said.
Jenkins - source automation server. It manages stuff like deployment, but also helps with Continuous Integration and Delivery. I liked it because I worked with it, didn't have to make it work myself. It's written in Java, after all :)
Microsoft Exchange - used not only for emails, but also for planning of meetings between people.
Notepad - not kidding. With all those wonderful collaborative tools I described above it is easy to forget that individuals too need to plan their work. I found it very helpful to always split my work into little things to do which I write in a text file, then follow the plan. It helps a lot when someone interrupts you from your work and then you need to know where you left off.

This is by no means a comprehensive "all you ever need" list, just some of the tools that I found really useful. However, just paying for them and installing them is not enough. They need to work together. And that's where you absolutely need a role, a person who will take care of all the software, bind them together like the ring of Sauron, and make them seamless for your team which, after all, has people with other stuff to concern themselves with.

My experience

At the beginning at the sprint we would have a discussion on stories that we would like to address in it. Stories are descriptions of features that need to be implemented. They start as business cases "As [some role], I want [something] in order to achieve [some goal]" and together with the product owner (the person in your team that is closest to the client and represents their interests) the whole team discusses overall details and splits it into tasks of various general complexities. About here the team has a rough idea of what will be part of the sprint. Documents are created with the story specifications (using Confluence, for example).

Tasks are technical in nature, so only the technical part of the team, like developers and testers, continue to discuss them. The important part here is to determine the implementation details and estimate complexity in actual time. At the end of this meeting, stories are split into tasks that have people attached to them and are estimated in hours. Now the team can say with some semblance of certainty what tasks can and which cannot be part of the sprint, taking into consideration the skill of the people in the team and their availability in the coming sprint.

Now, with the sprint set, you start work. Developers are writing technical briefs on the desired implementation, eventually the changes, comments, difficulties encountered, etc. (also using Confluence). For each feature brief you will have a technical brief, split into the tasks above. Why waste time with writing documentation while you are developing? Because whenever someone comes and asks what that feature is about, you send them a link. Because whenever you want to move the project to another team, they only have to go to one place and see what has been planned, developed and what is the status of work. The testing team also writes documents on how to proceed with testing. Testing documents will be reviewed by developers and only when that is done, the testing can proceed. Testing documents may seem a bit much, but with this any tester can just do any of the work. With a clear plan, you can take a random person and ask them to test your software. I am not trying to minimize the importance of professional testing skills, but some times you actually want people that are not associated with the development to do the testing. They will note any friction in using your software exactly because they are not close to you and have never used your product before. Also, an important use of testing documents is creating unit tests, which is a developer task.

When a piece of code has been submitted to source control, it must be reviewed (Code Collab) and the deployment framework (Jenkins) will create a task for code review (Code Collab) automate a deploy and run all unit tests and static analysis. If any of them fail, a task will be created in the task manager (Jira). When testing finds bugs, they create items in the bug tracker (Jira). Developers will go over the tasks in Jira and fix what needs fixing. Email will be sent to notify people of new tasks, comments or changes in task status. Meetings to smooth things over and discuss things that are not clear will be created in the meeting software (Exchange), where conflicts will be clear to see and solve. BTW, the room where the meeting is to take place also needs to be tracked, so that meetings don't overlap in time and space.

Wait, you will say, but you are talking more of writing documents and going to meetings and less of software writing. Surely this is a complete waste of time!

No one reasonably expects to work the entire time writing code or testing or whatever their main responsibility is. Some companies assume developers will write code only four hours out of eight. The rest is reserved for planning, meetings, writing documentation and other administrative tasks. Will you write twice as much if you code for eight hours a day? Sure. Will anyone remember what you wrote in two weeks time? No. So the issue is not if you are writing, reading documents and going to meetings instead of coding, but if you are doing all that instead of trying to figure out what the code wanted to do, what it actually does and who the hell wrote that crap (Oh, it was me, shit!). Frankly, in the long run, having a clear picture of what you want and what the entire team did wins big time.

Also please take note on how software integrates with everything to make things better for you. At my current place of work they use the same software, but it is not integrated. People create code reviews if they feel like it, bugs are not related to commits, documentation is vague and out of date, unit tests are run if someone remembers to run them and then no one does because some of them fail and no one remembers why or even wants to take a look. Tooling is nothing if it doesn't help you, if it doesn't work well together. Automate everything!

What I described above worked very well. Were people complaining about too many meetings? Yes. Was the Scrum Master stopping every second discussion in daily meetings because they were meandering out of scope? Sure. Was everybody hating said Scrum Master for interrupting them when saying important stuff then making them listen to all that testing and design crap? Of course. Ignore for a moment you will never get three people together and not have one complain. The complaints were sometimes spot on. As the person responsible for the process you must continuously adapt it to needs. Everybody in an agile team needs to be agile. Sometimes you need to remove or shorten a step, then restore them to their original and then back again. You will never have all needs satisfied at the same time, but you need to take care of the ones that are most important (not necessarily more urgent).

For example, I experienced this process for a web software project that had one hundred thousand customers that created web sites for millions of people that would visit them. There were over seventy people working on it. The product was solid and in constant use for years and years. Having a clear understanding of what is going on in it was very important. Your project might be different. A small tool that you use internally doesn't need as much testing, since bugs will appear with use. It does need documentation, so people know how to use it. A game that you expect people to play en masse for a year then forget about it also has some other requirements.

For me, the question for all cases is the same: "What would you rather be doing?". The process needs to take into account the options and choose the best ones. Would you rather have a shitty product that sells quickly and then is thrown away? Frankly, some people reply wholeheartedly yes, but that's because they would rather work less and earn more. But would you be rather reading code written by someone else and trying to constantly merge your development style with what you think they wanted to do there or reading a document that explains clearly what has been done and why? Would you rather expect people to code review when they feel like it or whenever something is put in source control? Answer this question for all members of the team and you will clearly see what needs to be done. The hard part will be to sell it to your boss, as always.

Daily meeting may not be that important if all the tasks take more than three days, but then you have to ask yourself why do tasks take so long. Could you have split development more finely? Aftermath or postmortem meetings are also essential. In them you discuss what went well and what went wrong in the previous sprint. It gives you the important feedback you need in order to adapt the process to your team and to always, always improve.

Development

I mentioned Continuous Integration above. It's the part that takes every bit of code committed, deploys that changelist, and runs all unit tests to see if anything fails. At the end you either have all green or the person that committed the code is notified that they broke something. Not only it gives more confidence for the code written, but it also promotes unit testing, which in turn promotes a modular way of writing code. When you know you will write unit tests for all written code you will plan that code differently. You will not make methods depend on the current time, you will take all dependencies and plug them in your class without hardcoding them inside. Modularity promotes code that is easy to understand by itself, easy to read, easy to maintain and modify.

Documenting work before it starts makes it easy to find any issues before code writing even began, it makes for easy development of tasks that are implemented one after the other. Reviewing test plans makes developers see their work from a different perspective and have those nice "oh, I didn't think of that" moments.

And code review. If you need to cut as much as possible from process, remove everything else before you abandon code review. Having people read your code and shit all over it may seem cruel and unusual punishment, but it helps a lot. You cannot claim to have easy to understand code if no one reads it. And then there are those moments when you are too tired to write good code. "Ah, it works just the same" you think, while you vomit some substandard code and claim the task completed. Later on, when you are all rested, you will look at the code and wonder "What was I thinking?". If someone else writes something exactly like that you will scold them fiercely for being lazy and not elegant. People don't crap all over you because they hate you, but because they see problems in your code. You will take that constructive criticism and become a better developer, a better team player and a better worker for it.

Conclusions

In this post I tried to show you less how things ought to be, but how things can be. It must be your choice to change things and how to do it. Personally I enjoyed a lot this kind of organization that I felt was freeing me from all the burdens of daily work except the actual software writing that I love doing. From the managerial standpoint it also makes sense to never waste someone's skills for anything else than what they are good at. I was amazed to see how people not only not work like this, but a lot of them don't even knew this was possible and some even outright reject it.

In the way of criticism I've heard a lot of "People before process" and "We want to work here" and "We are here to have fun" and other affirmations that only show how people don't really get it. I've stated many a time above that the process needs to adapt to people; yes, it is very important and it must be enforced, but with the needs of the team put always first. This process speeds up, makes more clear and improves the work, it doesn't stop or delay it. As for fun, working with clear goals in mind, knowing you have the approval of everyone in the team and most wrinkles in the design have been smoothed out before you even began writing code is nothing but fun. It's like a beautiful woman with an instruction manual!

However, just because process can help you, doesn't mean it actually does. It must be implemented properly. It is extremely easy to fall into the "Process over people" trap, for example, and then the criticism above is not only right, but pertinent. You need to do something about it. It doesn't help that process improves productivity ten fold if all the people in the team are malcontent and work a hundred times slower just because you give them no alternative.

Another pitfall is to conceive the most beautiful process in the world, make all agile teams be productive and happy, then fail to synchronize them. In other words you will have several teams that all work as one, but the group of teams is itself not a team. That's why we had something called the Scrum of Scrums, when representatives from each team would meet and discuss goals and results.

Now ask yourself, what would you rather be doing?

Dependency Injection, Inversion of Control, Testability and other nice things

Published Dec 3, 2016

and has 0 comments

I am mentally preparing for giving a talk about dependency injection and inversion of control and how are they important, so I intend to clarify my thoughts on the blog first. This has been spurred by seeing how so many talented and even experienced programmers don't really understand the concepts and why they should use them. I also intend to briefly explore these concepts in the context of programming languages other than C#.

And yes, I know I've started an ASP.Net MVC exploration series and stopped midway, and I truly intend to continue it, it's just that this is more urgent.

Head on intro

So, instead of going to the definitions, let me give you some examples, instead.

public class MyClass {
  public IEnumerable<string> GetData() {
    var provider=new StringDataProvider();
    var data=provider.GetStringsNewerThan(DateTime.Now-TimeSpan.FromHours(1));
    return data;
  }
}

In this piece of code I create a class that has a method that gets some text. That's why I use a StringDataProvider, because I want to be provided with string data. I named my class so that it describes as best as possible what it intends to do, yet that descriptiveness is getting lost up the chain when my method is called just GetData. It is called so because it is the data that I need in the context of MyClass, which may not care, for example, that it is in string format. Maybe MyClass just displays enumerations of objects. Another issue with this is that it hides the date and time parameter that I pass in the method. I am getting string data, but not all of it, just for the last hour. Functionally, this will work fine: task complete, you can move to the next. Yet it has some nagging issues.

Dependency Injection

Let me show you the same piece of code, written with dependency injection in mind:

public class MyClass {
  private IDataProvider _dataProvider;
  private IDateTimeProvider _dateTimeProvider;

  public void MyClass(IDataProvider dataProvider, IDateTimeProvider dateTimeProvider) {
    this._dataProvider=dataProvider;
    this._dateTimeProvider=dateTimeProvider;
  }

  public IEnumerable<string> GetData() {
    var oneHourBefore=_dateTimeProvider.Now-TimeSpan.FromHours(1);
    var data=_dataProvider.GetDataNewerThan(oneHourBefore);
    return data;
  }
}

A lot more code, but it solves several issues while introducing so many benefits that I wonder why people don't code like this from the get go.

Let's analyse this for a bit. First of all I introduce a constructor to MyClass, one that accepts and caches two parameters. They are not class types, but interfaces, which declare the intention for any class implementing them. The method then does the same thing as in the original example, using the providers it cached. Now, when I write the code of the class I don't actually need to have any provider implementation. I just declare what I need and worry about it later. I also don't need to inject real providers, I can mock them so that I can test my class as standalone. Note that the previous implementation of the class would have returned different data based on the system time and I had no way to control that behavior. The best benefit, for me, is that now the class is really descriptive. It almost reads like English: "Hi, folks, I am a class that needs someone to give me some data and the time of day and I will give you some processed data in return!". The rule of thumb is that for each method, external factors that may influence its behavior must be abstracted away. In our case if the date time provider provides the same time and the data provider the same data, the effect of the method is always the same.

Note that the interface I used was not IStringDataProvider, but IDataProvider. I don't really care, in my class, that the data is a bunch of strings. There is something called the Single Responsibility Principle, which says that a class or a method or some sort of unit of computation should try to only have one responsibility. If you change that code, it should only affect one area. Now, real life is a little different and classes do many things in many directions, yet they can implement any number of interfaces. The interfaces themselves can declare only one responsibility, which is why this is so nice. I don't actually have to have a class that is only a data provider, but in the context of my class, I only need that part and I am clearly declaring my intent in the code.

This here is called dependency injection, which is a fancy expression for saying "my code receives all third party instances as parameters". It is also in line with the Single Responsibility Principle, as now your class doesn't have to carry the responsibility of knowing how to instantiate the classes it needs. It makes the code more modular, easier to test, more legible and more maintainable.

But there is a problem. While before I was using something like new MyClass().GetData(), now I have to push the instantiation of the providers somewhere up the stream and do maybe something like this:

var dataProvider=new StringDataProvider();
var dateTimeProvider=new DateTimeProvider();
var myClass=new MyClass(dataProvider,dateTimeProvider);
myClass.GetData();

The apparent gains were all for naught! I just pushed the same ugly code somewhere else. But here is where Inversion of Control comes in. What if you never need to instantiate anything again? What it you never actually had to write any new Something() code?

Inversion of Control

Inversion of Control actually takes over the responsibility of creating instances from you. With it, you might get this code instead:

public interface IMyClass {
  IEnumerable<string> GetData();
}

public class MyClass:IMyClass {
  private IDataProvider _dataProvider;
  private IDateTimeProvider _dateTimeProvider;

  public void MyClass(IDataProvider dataProvider, IDateTimeProvider dateTimeProvider) {
    this._dataProvider=dataProvider;
    this._dateTimeProvider=dateTimeProvider;
  }

  public IEnumerable<string> GetData() {
    var oneHourBefore=_dateTimeProvider.Now-TimeSpan.FromHours(1);
    var data=_dataProvider.GetDataNewerThan(oneHourBefore);
    return data;
  }
}

Note that I created an interface for MyClass to implement, one that declares my GetData method. Now, to use it, I could write something like this:

var myClass=Dependency.Get<IMyClass>();
myClass.GetData();

Wow! What happened here? I just used a magical class called Dependency that gets me an instance of IMyClass. And I really don't care how it does it. It can discover implementations by itself or maybe I am manually binding interfaces to implementations when the application starts (for example Dependency.Bind<IMyClass,MyClass>();). When it needs to create a new MyClass it automatically sees that it needs two other interfaces as parameters, so it gets implementations for those first and continues up the chain. It is called a dependency chain and the container will go through it all to simply "Get" you what you need. There are many inversion of control frameworks out there, but the concept is so simple that one can make their own easily.

And I get another benefit: if I want to display some other type of data, all I have to do is instruct the dependency container that I want another implementation for the interface. I can even think about versioning: take a class that I know does the job and compare it with a new implementation of the same interface. I can tell it to use different versions based on the client used. And all of this in exactly one place: the dependency container bindings. You may want to plug different implementations provided by third parties and all they have to care about is respecting the contract in your interface.

Solution structure

This way of writing code forces some changes in the structure of your projects. If all you have is written in a single project, you don't care, but if you want to split your work in several libraries, you have to take into account that interfaces need to be referenced by almost everything, including third party modules that you want to plug. That means the interfaces need their own library. Yet in order to declare the interfaces, you need access to all the data objects that their members need, so your Interfaces project needs to reference all the projects with data objects in them. And that means that your logic will be separated from your data objects in order to avoid circular dependencies. The only project that will probably need to go deeper will be the unit and integration test project.

Bottom line: in order to implement this painlessly, you need an Entities library, containing data objects, then an Interfaces library, containing the interfaces you need and, maybe, the dependency container mechanism, if you don't put it in yet another library. All the logic needs to be in other projects. And that brings us to a nice side effect: the only connection between logic modules is done via abstractions like interfaces and simple data containers. You can now substitute one library with another without actually caring about the rest. The unit tests will work just the same, the application will function just the same and functionality can be both encapsulated and programatically described.

There is a drawback to this. Whenever you need to see how some method is implemented and you navigate to definition, you will often reach the interface declaration, which tells you nothing. You then need to find classes that implement the interface or to search for uses of the interface method to find implementations. Even so, I would say that this is an IDE problem, not a dependency injection issue.

Other points of view

Now, the intro above describes what I understand by dependency injection and inversion of control. The official definition of Dependency Injection claims it is a subset of Inversion of Control, not a separate thing.

For example, Martin Fowler says that when he and his fellow software pattern creators thought of it, they called it Inversion of Control, but they decided that it was too broad a term, so they moved to calling it Dependency Injection. That seems strange to me, since I can describe situations where dependencies are injected, or at least passed around, but they are manually instantiated, or situations where the creation of instances is out of the control of the developer, but no dependencies are passed around. He seems to see both as one thing. On the other hand, the pattern where dependencies are injected by constructor, property setters or weird implementation of yet another set of interfaces (which he calls Dependency Injection) is different from Service Locator, where you specifically ask for a type of service.

Wikipedia says that Dependency Injection is a software pattern which implements Inversion of Control to resolve dependencies, while it calls Inversion of Control a design principle (so, not a pattern?) in which custom-written portions of a computer program receive the flow of control from a generic framework. It even goes so far as to say Dependency Injection is a specific type of Inversion of Control. Anyway, the pages there seem to follow the general definitions that Martin Fowler does, which pits Dependency Injection versus Service Locator.

On StackOverflow a very well viewed answer sees dependency injection as "giving an object its instance variables". I tend to agree. I also liked another answer below that said "DI is very much like the classic avoiding of hardcoded constants in the code." It makes one think of a variable as an abstraction for values of a certain type. Same page holds another interesting view: "Dependency Injection and dependency Injection Containers are different things: Dependency Injection is a method for writing better code, a DI Container is a tool to help injecting dependencies. You don't need a container to do dependency injection. However a container can help you."

Another StackOverflow question has tons of answers explaining how Dependency Injection is a particular case of Inversion of Control. They all seem to have read Fowler before answering, though.

A CodeProject article explains how Dependency Injection is just a flavor of Inversion of Control, others being Service Locator, Events, Delegates, etc.

Composition over inheritance, convention over configuration

An interesting side effect of this drastic decoupling of code is that it promotes composition over inheritance. Let's face it: inheritance was supposed to solve all of humanity's problems and it failed. You either have an endless chain of classes inheriting from each other from which you usually use only one or two or you get misguided attempts to allow inheritance from multiple sources which complicates understanding of what does what. Instead interfaces have become more widespread, as declarations of intent, while composition has provided more of what inheritance started off as promising. And what is dependency injection if not a sort of composition? In the intro example we compose a date time provider and a data provider into a time aware data provider, all the time while the actors in this composition need to know nothing else than the contracts each part must abide by. Do that same thing with other implementations and you get a different result. I will go as far as to say that inheritance defines what classes are, while composition defines what classes do, which is what matters in the end.

Another interesting effect is the wider adoption of convention over configuration. For example you can find the default implementation of an interface as the class that implements it and has the same name minus the preceding "I". Rather than explicitly tell the framework that we want to use the Manager class each time someone needs an IManager implementation, it can figure it out for itself by naming alone. This would never work if the responsibility of getting class instances resided with each method using them.

Real life examples

Simple Injector

If you look on the Internet, one of the first dependency injection frameworks you find for .Net is Simple Injector, which works on every flavor of .Net including Mono and Core. It's as easy to use as installing the NuGet package and doing something like this:

// 1. Create a new Simple Injector container
var container = new Container();

// 2. Configure the container (register)
container.Register<IUserRepository, SqlUserRepository>(Lifestyle.Transient);
container.Register<ILogger, MailLogger>(Lifestyle.Singleton);

// 3. Optionally verify the container's configuration.
container.Verify();

// 4. Get the implementation by type
IUserService service = container.GetInstance<IUserService>();

ASP.Net Core

ASP.Net Core has dependency injection built in. You configure your bindings in ConfigureServices:

public void ConfigureServices(IServiceCollection svcs)
{
  svcs.AddSingleton(_config);
 
  if (_env.IsDevelopment())
  {
    svcs.AddTransient<IMailService, LoggingMailService>();
  }
  else
  {
    svcs.AddTransient<IMailService, MailService>();
  }
 
  svcs.AddDbContext<WilderContext>(ServiceLifetime.Scoped);
 
  // ...
}

then you use any of the registered classes and interfaces as constructor parameters for controllers or even using them as method parameters (see FromServicesAttribute)

Managed Extensibility Framework

MEF is a big beast of a framework, but it can simplify a lot of work you would have to do to glue things together, especially in extensibility scenarios. Typically one would use attributes to declare which interface something "exports" and then use other attributes to "import" implementations in properties and values. All you need to do is put them in the same place. Something like this:

[Export(typeof(ICalculator))]
class SimpleCalculator : ICalculator {
  //...
}

class Program {

  [Import(typeof(ICalculator))]
  public ICalculator calculator;

  // do something with calculator
}

Of course, in order for this to work seamlessly you need stuff like this, as well:

private Program()
{
    //An aggregate catalog that combines multiple catalogs
    var catalog = new AggregateCatalog();
    //Adds all the parts found in the same assembly as the Program class
    catalog.Catalogs.Add(new AssemblyCatalog(typeof(Program).Assembly));
    catalog.Catalogs.Add(new DirectoryCatalog("C:\\Users\\SomeUser\\Documents\\Visual Studio 2010\\Projects\\SimpleCalculator3\\SimpleCalculator3\\Extensions"));


    //Create the CompositionContainer with the parts in the catalog
    _container = new CompositionContainer(catalog);

    //Fill the imports of this object
    try
    {
        this._container.ComposeParts(this);
    }
    catch (CompositionException compositionException)
    {
        Console.WriteLine(compositionException.ToString());
    }
}

Dependency Injection in other languages

Admit it, C# is great, but it is not by far the most used computer language. That place is reserved, at least for now, for Javascript. Not only is it untyped and dynamic, but Javascript isn't even a class inheritance language. It uses the so called prototype inheritance, which uses an instance of an object attached to a type to provide default values for the instance of said type. I know, it sounds confusing and it is, but what is important is that it has no concept of interfaces or reflection. So while it is trivial to create a dictionary of instances (or functions that create instances) of objects which you could then use to get what you need by using a string key (something like var manager=Dependency.Get('IManager');, for example) it is difficult to imagine how one could go through the entire chain of dependencies to create objects that need other objects.

And yet this is done, by AngularJs, RequireJs or any number of modern Javascript frameworks. The secret? Using regular expressions to determine the parameters needed for a constructor function after turning it to string. It's complicated and beyond the scope of this blog post, but take a look at this StackOverflow question and its answers to understand how it's done.

Let me show you an example from AngularJs:

angular.module('myModule', [])
  .directive('directiveName', ['depService', function(depService) {
    // ...
  }])

In this case the key/type of the service is explicit using an array notation that says "this is the list of parameters that the dependency injector needs to give to the function", but this might be have been written just as the function:

angular.module('myModule', [])
  .directive('directiveName', function(depService) {
    // ...
  })

In this case Angular would use the regular expression approach on the function string.

What about other languages? Java is very much like C# and the concepts there are similar. Even if all are flavors of C, C++ is very different, yet Dependency Injection can be achieved. I am not a C++ developer, so I can't tell you much about that, but take a look at this StackOverflow question and answers; it is claimed that there is no one method, but many that can be used to do dependency injection in C++.

In fact, the only languages I can think of that can't do dependency injection are silly ones like SQL. Since you cannot (reasonably) define your own types or pass functions along, the concept makes no sense. Even so, one can imagine creating dummy stored procedures that other stored procedures would use in order to be tested. There is no reason why you wouldn't use dependency injection if the language allows for it.

Testability

I mentioned briefly unit testing. Dependency Injection works hand in hand with automated testing. Given that the practice creates modules of software that give reproducible results for the same inputs and account for all the inputs, testing becomes a breeze. Let me give you some examples using Moq, a mocking library for .Net:

var dateTimeMock=new Mock<IDateTimeProvider>();
dateTimeMock
  .Setup(m=>m.Now)
  .Returns(new DateTime(2016,12,03));

var dataMock=new Mock<IDataProvider>();
dataMock
  .Setup(m=>m.GetDataNewerThan(It.IsAny<DateTime>()))
  .Returns(new[] { "test","data" });

var testClass=new MyClass(dateTimeMock.Object, dataMock.Object);

var result=testClass.GetData();
AssertDeepEqual(result,new[] { "test","data" });

First of all, I take care of all dependencies. I create a "mock" for each of them and I "set up" the methods or property setters/getters that interest me. I don't really need to set up the date time mock for Now, since the data from the data provider is always the same no matter the parameter, but it's there for you to see how it's done. Second, I instantiate the class I want to test using the Object property of my mocks, which returns an object that implements the type given as a generic parameter in the constructor. Third I assert that the side effects of my call are the ones I expect. The mocks need to be as dumb as possible. If you feel you need to write code to define your mocks you are probably doing something wrong.

The type of the tests, for people who are not familiar with this concept, is usually a fully positive one - that is give full valid data and expect the correct result - followed by many negative ones, where the correct data is made incorrect in all possible ways and it is tested that the method fails. If there are many types of combinations of data that would be considered valid, you need a test for as many of them.

Note that the test is instantiating the test class directly, using the constructor. We are not testing the injector here, but the actual class.

Conclusions

What I appreciate most with Dependency Injection is that it forces you to write code that has clear boundaries defined by interfaces. Once this is achieved, you can go write your own stuff and not care about what other people do with theirs. You can test your modules without even caring if the rest of the project even exists. It allows to refactor code in steps and with a lot more confidence since you are covered by unit tests.

While some people work on fire-and-forget projects, like small games or utilities, and they don't care about maintainability, one of the most touted reasons for using unit tests and dependency injection, these practices bring so many other benefits that are almost impossible to get otherwise.

The entire point of this is reducing the complexity of dependencies, which include not only the modules in your application, but also the support frame for them, like people working on them. While some managers might not see the wisdom of reducing friction between software components, surely they can see the positive value of reducing friction between people.

There was one other topic that I wanted to touch, but it is both vast and I have not enough experience with it, however it feels very attractive to me: refactoring old code in order to use dependency injection. Best practices, how to make it safe enough and fast enough to make managers approve it and so on. Perhaps another post later on. I was thinking of a combination of static analysis and automated methods, like replacing all usages of "new" with a single point of instantiation, warning about static methods and properties, automatically replacing known bad practices like DateTime.Now and so on. It might be interesting, right?

I hope I wasn't too confusing and I appreciate any feedback you have. I will be working on a presentation file with similar content, so any help will go into doing a better job explaining it to others.

The perils of giving your data objects methods

Published Nov 23, 2016

Posted in
.NET
programming

and has 0 comments

A colleague of mine hit a strange bug today. It so happened that we use a bastardized dependency injection method that takes into account the WCF session before returning an implementation of an interface. In a piece of code the injection failed and we couldn't see why for a while. Let me give you a simplified version:

var someManager=Package.Get<IManager>();
var someDTOs=Cache.GetDatabaseObjects().Select(x=>x.Pack());

public class DataObject {
    public string Data {get;set;}
    public DataObjectDTO Pack() {
        var anotherManager=Package.Get<IAnother>();
        return new DataObjectDTO {
           Data=anotherManager.Process(Data)
        };
    }
}

Package.Get will attempt to find a session object and if not it will use another mechanism, but if it finds one, it will only use it if it is not expired or invalid, else throwing an exception. This code failed in the Pack method, when trying to get an instance of IAnother. Please take a few moments to reflect on why (and no, it's not that between calls the session expired).

Show explanation

The expression being assigned to '[something] must be constant when using integers inside strings

Published Nov 21, 2016

Posted in
.NET
programming
C#

and has 0 comments

I've stumbled upon a very funny exception today. Basically I was creating a constant string from adding some other constant strings to each other. And it worked. The moment I added an integer, though, I got The expression being assigned to 'Program.x2' must be constant. The code that generated this error is simple:

const string x2 = "string" + 2;

Note that

const string x2 = "string" + "2";

is perfectly valid. Got the same result when using VS2010 and VS2015, so it's not a compiler bug, it's intended behavior.

So, what's going on? Well, my code transforms behind the scenes into

const string x2 = "string" + 2.ToString();

which is not constant because of ToString!

The only way to solve it was to declare the numeric constant as string as well.

Getting random rows from a table in T-SQL: TABLESAMPLE [instructional post, but not a recommended method]

Published Nov 20, 2016

Posted in
database
programming

and has 0 comments

This clause is so obscure that I couldn't even find the Microsoft reference page for it for a few minutes, so no wonder I didn't know about it. Introduced in SQL Server 2005, the TABLESAMPLE clause limits the number of rows returned from a table in the FROM clause to a sample number or PERCENT of rows.

Usage:

TABLESAMPLE (sample_number [ PERCENT | ROWS ] ) [ REPEATABLE (repeat_seed) ]

REPEATABLE is used to set the seed of the random number generator so one can get the same result if running the query again.

It sounds great at the beginning, until you start seeing the limitations:

it cannot be applied to derived tables, tables from linked servers, and tables derived from table-valued functions, rowset functions, or OPENXML
the number of rows returned is approximate. 10 ROWS doesn't necessarily return 10 records. In fact, the functionality underneath transforms 10 into a percentage, first
a join of two tables is likely to return a match for each row in both tables; however, if TABLESAMPLE is specified for either of the two tables, some rows returned from the unsampled table are unlikely to have a matching row in the sampled table.
it isn't even that random!

Funny enough, even the reference page recommends a different way of getting a random sample of rows from a table:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

Even if probably not really usable, at least I've learned something new about SQL.

Update:
More about getting random samples from a table here, where it explains why ORDER BY NEWID() is not the way to do it and gives hints of what really happens in the background when we invoke TABLESAMPLE.
Another interesting article on the subject, focused more on the statistical probability, can be found here, where it also shows how TABLESAMPLE's cluster sampling may fail in spectacular ways.

Controlling JSON serialization in .Net Core Web API (Serialize enum values as strings, not integers)

Published Oct 29, 2016

Posted in
.NET
programming
Core
C#

and has 16 comments

.Net Core Web API uses Newtonsoft's Json.NET to do JSON serialization and for other cases where you wanted to control Json.NET options you would do something like

JsonConvert.DefaultSettings = (() =>
{
    var settings = new JsonSerializerSettings();
    // do something with settings
    return settings;
});

, but in this case it doesn't work. The way to do it is to use the fluent interface method and hook yourself in the ConfigureServices(IServiceCollection services) method, after the call to .AddMvc(), like this:

services
    .AddMvc()
    .AddJsonOptions(options =>
    {
        var settings=options.SerializerSettings;
        // do something with settings
    });

In my particular case I wanted to serialize enums as strings, not as integers. To do that, you need to use the StringEnumConverter class. For example if you wanted to serialize the Gender property of a person as a string you could have defined the entity like this:

public class Person
{
    public string Name { get; set; }
    [JsonConverter(typeof(StringEnumConverter))]
    public GenderEnum Gender { get; set; }
}

In order to do this globally, add the converter to the settings converter list:

services
    .AddMvc()
    .AddJsonOptions(options =>
    {
        options.SerializerSettings.Converters.Add(new StringEnumConverter {
            CamelCaseText = true
        });
    });

Note that in this case, I also instructed the converter to use camel case. The result of the serialization ends up as:

{"name":"James Carpenter","age":51,"gender":"male"}

Beware LINQ OrderBy in performance sensitive cases

Published Oct 22, 2016

Posted in
.NET
programming
C#

and has 0 comments

I was doing this silly HackerRank algorithm challenge and I got the solution correctly, but it would always time out on test 7. I wracked my brain on all sorts of different ideas but to no avail. I was ready to throw in the towel and check out other people solutions, only they were all in C++ and seemed pretty similar to my own. And then I've made a code change and the test passed. I had replaced LINQ's OrderBy with Array.Sort.

Intrigued, I started investigating. The idea was creating a sorted integer array from a space delimited string of values. I had used Console.ReadLine().Split(' ').Select(s=>int.Parse(s)).OrderBy(v=>v); and it consumed above 7% of the total CPU of the test. Now I was using var arr=Console.ReadLine().Split(' ').Select(s=>int.Parse(s)).ToArray(); Array.Sort(arr); and the CPU usage for that piece of the code was 1.5%. So it was almost five times slower. How do the two implementations differ?

Array.Sort should be simple: an in place quicksort, the best general solution for this sort (heh heh heh) of problem. How about Enumerable.OrderBy? It returns an OrderedEnumerable which internally uses a Buffer<T> to get all the values in a container, then uses an EnumerableSorter to ... quicksort the values. Hmm...

Let's get back to Array.Sort. It's not as straightforward as it seems. First of all it "tries" a SZSort. If it works, fine, return that. This is an external native code implementation of QuickSort on several native value types. (More on that here) Then it goes to a SorterObjectArray that chooses, based on framework target, to use either an IntrospectiveSort or a DepthLimitedQuickSort. Even the implementation of this DepthLimitedQuickSort is much, much more complex than the quicksort used by OrderBy. IntrospectiveSort seems to be the one preferred for the future and is also heavily optimized, but less complex and easier to understand, perhaps. It uses quicksort, heapsort and insertionsort together.

Now, before you go all "OrderBy sucks!", read more about it. This StackOverflow list of answers seems to indicate that in case of strings, at least, the performance is similar. A lot of other interesting things there, as well. OrderBy uses a "stable" QuickSort, meaning that two items that are compared as equal will appear in their original order. Array.Sort does not guarantee that.

Anyway, the performance difference in my particular case seems to come from the native code implementation of the sort for integers, rather than algorithmic improvements, although I don't have the time right now to grab the various implementations and test them properly. However, just from the way the code reads, I would bet the IntrospectiveSort will compare favorably to the simple Quicksort implementation used in OrderBy.

Fragmentation in an SQL table with lots of inserts, updates and deletes

Published Oct 7, 2016

Posted in
database
programming

and has 0 comments

I've met an interesting case today when we needed to manipulate data from tens of thousands of people daily. Assuming we would use table rows for the information, then we get a table in which rows are constantly added, updated and deleted. The issue is with the space allocated in table pages.

SQL works like this: If it needs space it allocates some as a "page" which can contain more records. When you delete records the space is not reclaimed, it remains as is (this is called ghosting). The exception is when all records in a page are deleted, in which case the page is reused as an empty page. When you update a record with more data then it held before (like when you have a variable length column), the page is split, with the rest of the records on the page moved to a new page.

In a heap table (no clustered index) the space inside pages is reused for new records or for updated records that don't fit in their allocated space, however if you use a clustered index, like a primary key, the space is not reused, since there needs to be a correlation between the value of the column and its position in the page. And here lies the problem. You may end up with a lot of pages with very few records in them. A typical page is 8 kilobytes, so a table with a few integers in a record can hold hundreds of records on a single page.

Fragmentation can be within a page, as described above, also called internal, but also external, between pages, when the recycled pages are used for data that is out of order. To get a large swathe of records the disk might be worked hard in order to jump from page to page to get what is logically a continuous blob of data. It is disk input/output that kills a database.

OK, back to our case. A possible solution was to store all the data for a user in a "blob", a VARBINARY column. For reads or changes only the disk space occupied by the blob would be changed, with C# code handling everything. It's what is called trading CPU for IO, which is generally good. However this NoSql-like idea itself smelled badly to me. We are supposed to trust our databases, not work against them. The solution I chose is monitoring index fragmentation and occasionally issuing clustered index rebuilding or reorganizing. I am willing to bet that reading/writing the data equivalent to several pages of table is going to be more expensive than selecting the changes I want to make. Also, rebuilding the index will end up storing all the data per user in the same space anyway.

However, this case made me think. Here is a situation in which the solution might have been (and it was in a similar case implemented by someone else) to micromanage the way the database works. It made me question using a clustered index/primary key on a table.

These articles helped me understand more:

Undo a Perforce changelist

Published Oct 5, 2016

Posted in
programming
software

and has 0 comments

I had this problem with Perforce where I accidentally Reconciled my offline work with all the files in /bin and /obj folders, resulting in a huge 6000+ file changelist. OK, simple one button mistake, surely there must be some one button undoing what I just did. It appears there is not.

In order to fix this I have to follow these steps:

Change the settings of Perforce to show files even in changelists larger than 1000 items (the default value)
Select by hand in the changelist window the files from obj and bin folders and using Revert on them
Revert the few other files that were unwanted in the changelist, like .suo and .user files - note that Revert on added files doesn't delete them, it just unadds them
Create a file with paths to ignore and then use p4 set P4IGNORE=<filename> for future reconcile work

What didn't work was adding a filename or path filter when visualizing the changelist, since that is a changelist filter, not a files filter. It will show you changelists that have files that contain the pattern, but not filter the files inside the changelists themselves.

For reference, the p4ignore file I used looked like this:

p4ignore
bin/Debug
obj/Debug
*.suo
*.user

Note that I also added the p4ignore file itself, although the file was not in any Perforce repository (yet).

"But, Siderite, you should use Git (or whatever source control is the newest fad at the moment)!" Wish that could, my friend, wish that I could.

Switch any site to a dark/light scheme

Published Sep 26, 2016

Posted in
misc
programming

and has 0 comments

Update May 2020: I used this on a web site and the body was white. It may be that the bug in Chrome was solved in the meantime.

Update October 2019: a CSS media feature (prefers-color-scheme) can be used in conjunction with this. A recent development, it's a media query that allows a browser to activate CSS code based on the theme set in the operating system. You set your preference in Windows or MacOS or wherever and then sites that use prefers-color-scheme will take advantage of that. Something like this:

@media (prefers-color-scheme: dark) {
  html,img, video, object, [style*=url] {
    -webkit-filter:invert(100%) !important;
    filter:invert(100%) !important;
  }

  /* this was solving a bug in Chrome that seems to have been fixed
  body {
    background:black;
  }
  */
}

I am a very light sensitive person. Shine a light in my eyes and you limit my productivity immensely. Not to mention it makes me irritable. Therefore I often have the desire to turn cheerful black on white sites to a dark theme, where the colors are reversed. I am sure other people have the same problem so I thought of building a browser extension to enable a switching button between the two.

The first problem is that I need to interrogate all the elements in a page, including the ones that will be created later. The second is that even so I would have problems determining the dominant color of images. But there is something I can use which makes all of this unnecessary: using the invert CSS filter! Since I already use a browser extension that injects my own styles in any site - it's called Stylish and I highly recommend it - all I have to do is apply a filter on the entire site, right?

Wrong! The problem is that when you invert an entire site, all images on the site get inverted, too. That also includes videos and Flash objects. The worst offenders here are the elements that sport a background image that is declared via CSS, since you can't create a CSS selector for them. I am going to present my partial solution and maybe you can help me find a more elegant or more complete one. Here is a general dark theme stylesheet, without the elements that have a background image declared via CSS (it does include those with a background image declared inline, though):

 html,img, video, object, [style*=url] {
    -webkit-filter:invert(100%) !important;
    filter:invert(100%) !important;
  }

 /* this was solving a bug in Chrome that seems to have been fixed
 body {
   background:black;
 }
 */

What it does is invert the entire page (html), then reinvert video, img, and object elements, as well as those with "url" in the style attribute. In Chrome, at least, there seems to be a bug in the sense that the backgrounds of the direct child elements are not inverted, which means body, as the first child, needs to have the background set to black specifically. (this seems to have been solved by May 2020) The hack to invert elements with "url" in style is pretty ugly, too.

What I think of a solution is this:

inject Javascript to enumerate all elements present and future using document.createTreeWalker and Mutation Observers, check if they have a background image and if so, add a class to them
inject the CSS above with an additional rule for the class for the elements with background image

However, this doesn't completely solve the problem. One of the major issues is the inverted colors sometimes look dumb. For example a red background turns cyan, white text on light gray background turns black text on dark gray background, which makes it hard to read. I've tried various other filters, like hue-rotate or contrast, but it doesn't really help. Detecting individual color patterns doesn't really work, as the filter attribute affects an element and all of its children. The CSS above only works because the images are inverted again when the entire page has been inverted.

The good news is that most of the time you may use the CSS above as a template, then add various rules (manually) to fix small issues with colors of backgrounds. Even if I don't package this in an extension, you have the power to create your own themes for various sites. Never again will you be subjects to the tyranny of the happy bright shiny people!

Eye openers from the SQLSaturday meeting

Published Sep 25, 2016

Posted in
database
programming

and has 0 comments

I was glad to attend the 565th SQLSaturday in Bucharest yesterday and, while all presentations were cool, I wanted to share with you some specific points that I found very revealing. Without further ado, here is the list:

SQL execution plans are read from right to left - such a simple thing, but I remember when I was trying to read them from left to right and I didn't get anything. In SQL Server Management 2016 you also get a "live" version, which shows you an execution plan while it's executing. Really useful to see where the blocking operations are.
Manually control your statistics update - execution plans are calculated based on statistics, but the condition for updating the statistics is to have changes in a number of 20% of the rows plus 500 of any table. This default setting is completely arbitrary and may cause a lot of pain. Not only updating the statistics blocks your table (which means more chances that the table will be locked when it is most used), but sometimes the statistics are not useful. One example are reports which may receive a startdate/enddate range or a count or something like that which makes the number of rows affected vary immensely with different parameters. Use OPTION(RECOMPILE) for that.
Look for a difference between estimated and actual rows in a query plan, which leads to tempdb spills, which leads to unwanted IO operations - before a query, an execution plan is created or reused, based on statistics, as I was saying above. Once a plan has been chosen, though, it doesn't change during its execution. Basically what this means is that the structure of the plan remains unchanged between the estimated and actual plan. Also based on the plan, memory is requested and never changed. So if the plan asks for 10KB of memory and you need 1000KB, the rest of 990 will be stored and used from tempdb even if there is enough memory to put them in, since the memory requirements don't change from estimated to actual. The reverse is not much better, since a plan may ask for a lot of memory when it only needs little, thus making everything else on that machine have less available resources.
SQL default settings suck - there was an entire presentation about that, it is useful to think a little about it. So many settings are legacy things that make no sense, like the initial database size, the autogrow size, index fill factors, maxdop (degree of parallelism), parallelism threshold, used memory (ironically, using all of it may be hurtful as it takes it away from other processes which leads to using the swap file), etc.
Look for hard page faults - this counter is much useful than the soft page faults, which are fixable faults. A hard page fault is indicative of unnecessary IO operations, which are orders of magnitude slower than memory use.

There are a lot more things that I want to explore now, since I participated to the event. You may find the files for the presentations in the same place as the full list of talks at SQLSaturday.