Intro

  So the requirement is "Write a class that would export data into a CSV format". This would be different from "Write a CSV parser", which I think could be interesting, but not as wildly complex as this. The difference comes from the fact that a CSV parser brings a number of problems for the interviewed person to think about right away, but then it quickly dries up as a source for intelligent debate. A CSV exporter seems much simpler, because the developer controls the output, but it increases in complexity as the interview progresses.

  This post is written from the viewpoint of the interviewer.

Setup

  First of all, you start with the most basic question: Do you know what CSV is? I was going to try this question out on a guy who came interviewing for senior developer and I was excited to see how it would go. He answered he didn't know what CSV was. Bummer! I was incredulous, but then I quickly found out he didn't know much else either. CSV is a text format for exporting a number of unidimensional records. The name comes from Comma Separated Values and might at first glance appear to be a tabular data format, an idea made even more credible by Excel being able to open and export .csv files. But it is not. As the name says, it has values separated by a comma. It might even be just one record. It might be containing multiple records of different types. In some cases, the separator for value and record are not even commas or newline.

  It is important to see how the interviewee explains what CSV is, because it is a concept that looks deceivingly simple. Someone who first considers the complexity of the format before starting writing the code works very differently in a team than someone who throws themselves into the code, confident (or just unrealistically optimistic) that they would solve any problem down the line.

  Some ideas to explore, although it pays off to not bring them up yourself:

  • What data do you need to export: arrays, data tables, list of records?
  • Are the records of the same type?
  • Are there restrictions on the type of record?
  • What separators will there be used? How to escape values that contain chosen separators?
  • Do values have restrictions, like not containing separators?
  • CSV header: do we support that? What does it mean in the context of different types of input? 
  • Text encoding, unicode, non-ASCII characters
  • How to handle null values?
  • Number and date formatting
  • Is there an RFC or a specification document for the CSV export format?

Implementation

  In this particular interview I have chosen that the CSV exporter class will only support an input of IEnumerable<T> (this is .NET speak for a bunch of objects of the same type).

  Give ample opportunities for questions from the person interviewed. This is not a speed test. It is important if the candidate considers by themselves issues like:

  • are the object properties simple types? Like string, long, integer, decimal, double, float, datetime?
  • since the requirement is any T, what about objects that are arrays, or self referencing, or having complex objects as properties?
  • how to extract the values of any object (discussions about .NET reflection or Javascript object property discovery show a lot about the interviewee, especially if they start said discussions)

  Go through the code with the candidate. This shows their ability to develop software. How will they name the class, what signature will they use for export method, how they structure the code and how readable it is.

  At this stage you should have a pretty good idea if the candidate is intelligent, competent and how they handle a complex problem from requirement to implementation.

Dig deeper

  This is the time to ask the questions yourself and see how they react to new information, the knowledge that they should have asked themselves the same questions and the stress of changing their design:

  • are comma and newline the only supported separators?
  • are separators characters or strings?
  • what if an exported value is a string containing a comma?
  • do you support values containing newline?
  • if you use quotes to hold a value containing commas and newlines, what happens if values contain quotes
  • empty or null values. Any difference? How to export them? What if the object itself is null?
  • how to handle the header of the CSV, where do you get the name of the properties?
  • what if the record type is an array or IEnumerable?
  • what will be the numeric and date formatting used for export?
  • does the candidate know what text encoding is? Which one will they use and why?

  How have the answers to these questions changed the design? Did the candidate redesign the work or held tight to the original idea and tried to fix everything as it comes?

  At this point you should know how the person being interviewed responds to new information, even scope creep and, maybe most importantly, to stress. But we're not done, are we?

Bring the pain

  Bring up the concept of unit testing. If you are lucky, the candidate already brought it up. Either way, now it is time to:

  • split the code into components: the reflection code, the export code, the file system code (if any).
  • abstract components into interfaces in order to mock them in unit tests
  • collect all the specifications gathered so far in order to cover all the cases
  • ask the candidate to write one unit test

Conclusion

  A seemingly simple question will take you and the interview candidate through:

  • finding out how the other person thinks
  • specification gathering
  • code design
  • implementation
  • technical knowledge in a multitude of directions
  • development process
  • separation of concerns, unit testing
  • human interaction in a variety of circumstances
  • determining how the candidate would fit in a team

Not bad for a one line question, right?

  I am going to do this in Javascript, because it's easier to write and easier for you to test (just press F12 in this page and write it in the console), but it applies to any programming language. The issue arises when you want to execute a background job (a setTimeout, an async method that you do not wait, a Task.Run, anything that runs on another execution path than the current one) inside a for loop. You have the index variable (from 0 to 10, for example) and you want to use it as a parameter for the background job. The result is not as expected, as all the background jobs use the same value for some reason.

  Let's see a bit of code:

// just write in the console numbers from 0 to 9
// but delayed for a second
for (var i=0; i<10; i++)
{
  setTimeout(function() { console.log(i); },1000);
}

// the result: 10 values of 10 after a second!

But why? The reason is the "scope" of the i variable. In this case, classic (EcmaScript 5) code that uses var generates a value that exists everywhere in the current scope, which for ES5 is defined as the function this code is running from or the global scope if executed directly. If after this loop we write console.log(i) we get 10, because the loop has incremented i, got it to 10 - which is not less than 10, and exited the loop. The variable is still available. That explains why, a second later, all the functions executed in setTimeout will display 10: that is the current value of the same variable.

Now, we can solve this by introducing a local scope inside the for. In ES5 it looked really cumbersome:

for (var i=0; i<10; i++)
{
  (function() {
    var li = i;
    setTimeout(function() { console.log(li); },1000);
  })();
}

The result is the expected one, values from 0 to 9.

What happened here? We added an anonymous function and executed it. This generates a function scope. Inside it, we added a li variable (local i) and then set executing the timeout using that variable. For each value from 0 to 9, another scope is created with another li variable. If after this code we write console.log(li) we get an error because li is undefined in this scope. It is a bit confusing, but there are 10 li variables in 10 different scopes.

Now, EcmaScript 6 wanted to align Javascript with other common use modern languages so they introduced local scope for variables by defining them differently. We can now use let and const to define variables that are either going to get modified or remain constant. They also exist only in the scope of an execution block (between curly brackets).

We can write the same code from above like this:

for (let i=0; i<10; i++) {
  const li = i;
  setTimeout(()=>console.log(li),1000);
}

In fact, this is more complex than it has to be, but that's because another Javascript quirk. We can simplify this for the same result as:

for (let i=0; i<10; i++) {
  setTimeout(()=>console.log(i),1000);
}

Why? Because we "let" the index variable, so it only exists in the context of the loop execution block, but apparently it creates one version of the variable for each loop run. Strangely enough, though, it doesn't work if we define it as "const".

Just as an aside, this is way less confusing with for...of loops because you can declare the item as const. Don't use "var", though, or you get the same problem we started with!

const arr=[1,2,3,4,5];
for (const item of arr) setTimeout(()=>console.log(item),1000);

In other languages, like C#, variables exist in the scope of their execution block by default, but using a for loop will not generate multiple versions of the same variable, so you need to define a local variable inside the loop to avoid this issue. Here is an example in C#:

for (var i=0; i<10; i++)
{
    var li = i;
    Task.Run(() => Console.WriteLine(li));
}
Thread.Sleep(1000);

Note that in the case above we added a Thread.Sleep to make sure the app doesn't close while the tasks are running and that the values of the loop will not necessarily be written in order, but that's beside the point here. Also, var is the way variables are defined in C# when the type can be inferred by the compiler, it is not the same as the one in Javascript.

I hope you now have a better understanding of variable scope.

Intro

  This is part of a series that I plan to build on as time goes on: technical interview questions, dissected and laid bare for both interviewers and interviewees. You can also check out the previous one: Interview question: all items in table A but not in B.

  This question is a little bit more complex and abstract at the same time. The post is written more for interviewers this time, because as candidates go, you need to read the links in it if you didn't know the concepts in it. This also is not a question with a single correct answer. It comes after asking about Dependency Injection as a whole and the candidate answering correctly.

  I expect senior developers to be able to go through this successfully, it is not a test for junior developers, although depending on their previous experience juniors might be able to go through it and seniors be force to reason through it.

The test

Bonus introduction question: why use DI at all? Expected answers would be separation of concerns and testability. 

  The question has two steps.

Step 1: given the following code in a legacy application, improve it to use Dependency Injection:

public SomeClass {
  public List<Item> GetItems(int days, string filter) {
    var service = new ItemService();
    return service.GetItems()
      .Where(i => i.Time >= DateTime.Now.AddDays(-days));
  }
}

Bonus questions:

  • has the candidate worked with LINQ before?
  • what does the code do?

Now, this question is about programming knowledge as it is for attention. There are three irregularities that can attract that attention:

  • the most obvious: the service is being instantiated by calling the constructor
    • the interviewer expects at the very least for the candidate to notice it and suggest moving the instantiation of the service in the constructor of the SomeClass class and inject it instead of using new
    • there is the possibility of passing the service as a parameter, as well, but suggest that the signature of the method should remain the same to get around it. Anyway, one can discuss the idea of moving all dependencies to the constructor and/or the calling methods and get insight in the way the candidate is thinking.
  • the unexplained string filter in the signature of the method
    • the interviewer can either tell the candidate that it will become relevant later, because it will, or that this is a method that implements an interface, to which a more snarky candidate might reply that SomeClass implements nothing (bonus for attention)
  • the use of DateTime.Now
    • it is a static property that gives a different output every time so it should be taken into account for Dependency Injection or at least for unit testing

By now you have filtered out the majority of failing candidates and you are left with someone who used or at least understands DI, can read and understand code, has used or at least understood basic LINQ and you have also gauged their level of attention to detail.

If the candidate only talked about the service and they decided to create an interface for ItemService and then add it as a parameter for the constructor of SomeClass, ask them to write a unit test for the method, explain to them that testability is one of the goals of DI if you didn't cover this already

  • bonus: see if they do unit testing or at least understand the concept
  • if they do attempt to write the unit test, ask them what would happen if you would run the test in different days

The expected result of this part is that the candidate understands the need of abstracting DateTime.Now. It is interesting to note how they intend to abstract it, since they do not have access to the code and it is a static method/property to abstract.

Whether the candidate figured it out by themselves or it was explained to them, the expected answer is that DateTime.Now is abstracted by creating an IDateTimeService interface that is implemented as a wrapper over DateTime.

At this point the code should look like this:

public SomeClass {
  private IItemService _itemService;
  private IDateTimeService _dateTimeService;

  public SomeClass(IItemService itemService, IDateTimeService dateTimeService) {
    _itemService = itemService;
    _dateTimeService = dateTimeService;
  }

  public List<Item> GetItems(int days, string filter) {
    return _itemService.GetItems()
      .Where(i => i.Time >= _dateTimeService.Now.AddDays(-days));
  }
}

Also, the candidate should be asked to write a unit test, just to see they know how, for bonus points. Note if the candidate understands isolation for unit testing or does something that would work but be silly like generate the test data based on current date or duplicate the code logic in the test instead of working with static data.

Step 2: tell the candidate that the legacy code they need to fix looks a bit different:

public SomeClass {
  public List<Item> GetItems(int days, string filter) {
    var service = new ItemService(filter);
    return service.GetItems()
      .Where(i => i.Time >= DateTime.Now.AddDays(-days));
  }
}

The ItemService now receives the filter as the parameter. Ask them what to do in this case.

The expected answer is a factory injected instead of the service, which will then be used to instantiate an IItemService with a parameter. Bonus discussion about software patterns can be inserted here.

There are other valid answers here, like using the DI container itself as a factory for the service, which might provoke interesting discussions in itself, like weighing constructor injection versus service provider in dependency injection and whether hybrid solutions might be better.

Bonus question: what if you cannot control the code of ItemService in step 1 and it does not implement an interface or a base class?

  • warning, this might give a hint for the second part of the interview, so use it at the end 
  • correct answer 1: use the class as the type of the parameter and let the dependency container decide how to instantiate it
  • correct answer 2: use a wrapper over the class that implements the interface and proxies to the instance methods.

Conclusion

For me this test told me a lot more about the candidate than just their dependency injection knowledge. We got to talking, I became aware of how their minds worked and I was both pleasantly surprised when they came with alternate solutions that kind of worked and a bit irked that they went that far and didn't see the superior option. Most of the time this made me think about the differences between what I would answer and what they did and this resulted in interesting discussions that enriched not only their experience, but also mine.

Dependency injection, separation of concerns and unit testing are important concepts for any modern developer. I hope this helps devs evolve and interviewers find the best candidates... at least until all of them get to read my blog.

As with all the programmer questions, I will update the post with the answer after people comment on this. Today's question is:

Here is a question for programmers. I will wait for your comments before answering.

A blog reader asked me to help him get rid of the ugly effect of a large background image getting loaded. I thought of several solutions, all more complicated than the rest, but in the end settled on one that seems to be working well and doesn't require complicated libraries or difficult implementation: using the img onload event.

Let's assume that the background image is on the body element of the page. The solution involves setting a style on the body to hide it (style="display:none") then adding as child of the body an image that also is hidden and that, when completing loading, shows the body element. Here is the initial code:
<style>
body {
background: url(bg.jpg) no-repeat center center fixed;
}
</style>
<body>

And after:

<style>
body {
background: url(bg.jpg) no-repeat center center fixed;
}
</style>
<body style="display:none">
<img src="bg.jpg" onload="document.body.style.display=''" style="display:none;" />

This loads the image in a hidden img element and shows the body element when the image finished loading.

The solution might have some problems with Internet Explorer 9, as it seems the load event is not fired for images retrieved from the cache. In that case, a slightly more complex Javascript solution is needed as detailed in this blog post: How to Fix the IE9 Image Onload Bug. Also, in Internet Explorer 5-7 the load event fires for animated GIFs at every loop. I am sure you know it's a bad idea to have an animated GIF as a page background, though :)

Warning: While this hides the effect of slow loading background images, it also hides the page until the image is loaded. This makes the page appear blank until then. More complex solutions would show some simple html content while the page is loading rather than hiding the entire page, but this post is about the simplest solution for the question asked.

A more comprehensive analysis of image preloading, complete with a very nice Javascript code that covers a lot of cases, can be found at Preloading images using javascript, the right way and without frameworks

I am starting a new blog series called Blog Question, due to the successful incorporation of a blog chat that works, is free, and does exactly what it should do: Chatango. All except letting me know in real time when a question has been posted on the chat :( . Because of that, many times I don't realize that someone is asking me things and so I fail to answer. As a solution, I will try to answer questions in blog posts, after I do my research. The new label associated with these posts is 'question'.

First off, some assumptions. I will assume that the person who said I'm working on this project of address validation. Using crf models is my concern. was talking about Conditional Random Fields and he meant postal addresses. If you are reading this, please let me know if that is correct. Also, since I am .NET developer, I will use concepts related to .NET.

I knew nothing about CRFs before writing this posts, so bear with me. The Wikipedia article about them is hard to understand by anyone without mathematical (specifically probabilities and statistics) training. However the first paragraph is pretty clear: Conditional random fields (CRFs) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. Whereas an ordinary classifier predicts a label for a single sample without regard to "neighboring" samples, a CRF can take context into account. It involves a process that classifies data by taking into account neighboring samples.

A blog post that clarified the concept much better was Introduction to Conditional Random Fields. It describes how one uses so called feature functions to extract a score from a data sample, then aggregates scores using weights. It also explains how those weights can be automatically computed (machine learning).

In the context of postal address parsing, one would create an interface for feature functions, implement a few of them based on domain specific knowledge, like "if it's an English or American address, the word before St. is a street name", then compute the weighting of the features by training the system using a manually tagged series of addresses. I guess the feature functions can ignore the neighboring words and also do stuff like "If this regular expression matches the address, then this fragment is a street name".

I find this concept really interesting (thanks for pointing it out to me) since it successfully combines feature extraction as defined by an expert and machine learning. Actually, the expert part is not that relevant, since the automated weighing will just give a score close to 0 to all stupid or buggy feature functions.

Of course, one doesn't need to do it from scratch, other people have done it in the past. One blog post that discusses this and also uses more probabilistic methods specifically to postal addresses can be found here: Probabilistic Postal Address Elementalization. From Hidden Markov Models, Maximum-Entropy Markov Models, Transformation-Based Learning and Conditional Random Fields, she found that the Maximum-Entropy Markov model and the Conditional Random Field taggers consistently had the highest overall accuracy of the group. Both consistently had accuracies over 98%, even on partial addresses. The code for this is public at GitHub, but it's written in Java.

When looking around for this post, I found a lot of references to a specific software called the Stanford Named Entity Recognizer, also written in Java, but which has a .NET port. I haven't used the software, but it seems as it is a very thorough implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION). Perhaps this would also come in handy.

This is as far as I am willing to go without discussing existing code or actually writing some. For more details, contact me and we can work on it.

More random stuff:
The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum entropy Markov models (MEMMs) and other conditional Markov models based on directed graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world sequence labeling tasks. - from Conditional Random Fields: An Introduction

Tutorial on Conditional Random Fields for Sequence Prediction

CRFsuite - Documentation

Extracting named entities in C# using the Stanford NLP Parser

Tutorial: Conditional Random Field (CRF)