The DataRow value setter is slow!

Update 19 February 2016:
I've done the test again, using another computer and .Net 4.6.1. The speed of filling the DataTableReplacement class given at the end of the article, plus copying the data into a DataTable object is 30% faster than using a DataTable directly with BeginLoadData/EndLoadData and 50% faster than using DataTable without the LoadData methods.
Now for the original post:
It was about time I wrote a smashing IT entry. Here is to the obnoxious DataTable object, something about I have written before of bugs and difficulty in handling. Until now I haven't really thought about what kind of performance issues I might face when using it. I mean, yeah, everybody says it is slow, but how slow can it be? Twice as slow? Computers are getting faster and faster, I might not need a personal research into this. I tried to make a DataTable replacement object once and it was not really compatible with anything that needed DataTables so I gave up. But in this article I will show you how a simple piece of code became 7 times faster when taking into account some DataTable issues.
But let's get to the smashing part :) I was using C# to transform the values in a column from a DataTable into columns. Something like this:
Name | Column | Value |
---|---|---|
George | Money | 100 |
George | Age | 31 |
George | Children | 1 |
Jack | Money | 150 |
Jack | Age | 26 |
Jack | Children | 0 |
Jane | Money | 300 |
Jane | Age | 33 |
Jane | Children | 2 |
and it must look like this:
Name | Money | Age | Children |
---|---|---|---|
George | 100 | 31 | 1 |
Jack | 150 | 26 | 0 |
Jane | 300 | 33 | 2 |
I have no idea how to do this in SQL, if you have any advice, please leave a comment.
Update: Here are some links about how to do it in SQL and SSIS:
Give the New PIVOT and UNPIVOT Commands in SQL Server 2005 a Whirl
Using PIVOT and UNPIVOT
Transposing rows and columns in SQL Server Integration Services
Using PIVOT, the SQL query would look like this:
SELECT *
FROM #input
PIVOT (
MAX([Value])
FOR [Column]
IN ([Money],[Age],[Children])
) as pivotTable
Anyway, the solution I had was to create the necessary table in the code behind add a row for each Name and a column for each of the distinct value of Column, then cycle through the rows of the original table and just place the values in the new table. All the values are present and already ordered so I only need to do it using row and column indexes that are easily computed.
The whole operation lasted 36 seconds. There were many rows and columns, you see. Anyway, I profiled the code, using the great JetBrains dotTrace program, and I noticed that 30 seconds from 36 were used by DataRow.set_Item(int, object)! I remembered then that the DataTable object has two BeginLoadData and EndLoadData methods that disable/enable the checks and constraints in the table. I did that and the operation went from 36 to 27 seconds.
Quite an improvement, but the bottleneck was still in the set_Item setter. So, I thought, what will happen if I don't use a DataTable at all. After all, the end result was being bound to a GridView and it, luckily, knows about object collections. But I was too lazy for that, as there was quite a complicated binding code mess waiting for refactoring. So I just used a List of object arrays instead of the DataTable, then I used DataTable.Rows.Add(object[]) from this intermediary list to the DataTable that I originally wanted to obtain. The time spent on the operation went from... no, wait
The time spent on the operation went from the 27 seconds I had obtained to 5! 5 seconds! Instead of 225.351 calls to DataRow.set_Item, I had 1533 calls to DataRowCollection.Add, from 21 seconds to 175 miliseconds!
Researching the reflected source of System.Data.dll I noticed that the DataRow indexer with an integer index was going through
DataColumn column=_columns[index]; return this[column];
How bad can it get?! I mean, really! There are sites that recommend you find the integer index of table columns and then use them as integer variables. Apparently this is NOT the best practice. Best is to use the DataColumn directly!
So avoid the DataRow setter.
Update July 18, 2013:
Someone requested code, so here is a console application with some inline classes to replace the DataTable in GridView situations:
class Program
{
static void Main(string[] args)
{
fillDataTable(false);
fillDataTable(true);
fillDataTableWriter();
Console.ReadKey();
}
private static void fillDataTable(bool loadData)
{
var dt = new DataTable();
dt.Columns.Add("cInt", typeof(int));
dt.Columns.Add("cString", typeof(string));
dt.Columns.Add("cBool", typeof(bool));
dt.Columns.Add("cDateTime", typeof(DateTime));
if (loadData) dt.BeginLoadData();
for (var i = 0; i < 100000; i++)
{
dt.Rows.Add(dt.NewRow());
}
var now = DateTime.Now;
for (var i = 0; i < 100000; i++)
{
dt.Rows[i]["cInt"] = 1;
dt.Rows[i]["cString"] = "Some string";
dt.Rows[i]["cBool"] = true;
dt.Rows[i]["cDateTime"] = now;
}
if (loadData) dt.EndLoadData();
Console.WriteLine("Filling DataTable"+(loadData?" with BeginLoadData/EndLoadData":"")+": "+(DateTime.Now - now).TotalMilliseconds);
}
private static void fillDataTableWriter()
{
var dt = new DataTableReplacement();
dt.Columns.Add("cInt", typeof(int));
dt.Columns.Add("cString", typeof(string));
dt.Columns.Add("cBool", typeof(bool));
dt.Columns.Add("cDateTime", typeof(DateTime));
for (var i = 0; i < 100000; i++)
{
dt.Rows.Add(dt.NewRow());
}
var now = DateTime.Now;
for (var i = 0; i < 100000; i++)
{
dt.Rows[i]["cInt"] = 1;
dt.Rows[i]["cString"] = "Some string";
dt.Rows[i]["cBool"] = true;
dt.Rows[i]["cDateTime"] = now;
}
var fillingTime = (DateTime.Now - now).TotalMilliseconds;
Console.WriteLine("Filling DataTableReplacement: "+fillingTime);
now = DateTime.Now;
var newDataTable = dt.ToDataTable();
var translatingTime = (DateTime.Now - now).TotalMilliseconds;
Console.WriteLine("Transforming DataTableReplacement to DataTable: " + translatingTime);
Console.WriteLine("Total filling and transforming: " + (fillingTime+translatingTime));
}
}
public class DataTableReplacement : IEnumerable<IEnumerable<object>>
{
public DataTableReplacement()
{
_columns = new DtrColumnCollection();
_rows = new DtrRowCollection();
}
private readonly DtrColumnCollection _columns;
private readonly DtrRowCollection _rows;
public DtrColumnCollection Columns
{
get { return _columns; }
}
public DtrRowCollection Rows { get { return _rows; } }
public DtrRow NewRow()
{
return new DtrRow(this);
}
public DataTable ToDataTable()
{
var dt = new DataTable();
dt.BeginLoadData();
_columns.CreateColumns(dt);
_rows.CreateRows(dt);
dt.EndLoadData();
return dt;
}
#region Implementation of IEnumerable
public IEnumerator<IEnumerable<object>> GetEnumerator()
{
foreach (var row in _rows)
{
yield return row.ToArray();
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
#endregion
}
public class DtrRowCollection : IEnumerable<DtrRow>
{
private readonly List<DtrRow> _rows;
public DtrRowCollection()
{
_rows = new List<DtrRow>();
}
public void Add(DtrRow newRow)
{
_rows.Add(newRow);
}
public DtrRow this[int i]
{
get { return _rows[i]; }
}
public void CreateRows(DataTable dt)
{
foreach (var dtrRow in _rows)
{
dt.Rows.Add(dtrRow.ToArray());
}
}
#region Implementation of IEnumerable
public IEnumerator<DtrRow> GetEnumerator()
{
return _rows.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
#endregion
}
public class DtrRow
{
private readonly object[] _arr;
private readonly DataTableReplacement _dtr;
public DtrRow(DataTableReplacement dtr)
{
_dtr = dtr;
var columnCount = _dtr.Columns.Count;
_arr = new object[columnCount];
}
public object this[string columnName]
{
get
{
var index = _dtr.Columns.GetIndex(columnName);
return _arr[index];
}
set
{
var index = _dtr.Columns.GetIndex(columnName);
_arr[index] = value;
}
}
public object this[int columnIndex]
{
get
{
return _arr[columnIndex];
}
set
{
_arr[columnIndex] = value;
}
}
public object[] ToArray()
{
return _arr;
}
}
public class DtrColumnCollection
{
private readonly Dictionary<string, int> _columnIndexes;
private readonly Dictionary<string, Type> _columnTypes;
public DtrColumnCollection()
{
_columnIndexes = new Dictionary<string, int>();
_columnTypes = new Dictionary<string, Type>();
}
public int Count { get { return _columnIndexes.Count; } }
public void Add(string columnName, Type columnType)
{
var index = _columnIndexes.Count;
_columnIndexes.Add(columnName, index);
_columnTypes.Add(columnName, columnType);
}
public int GetIndex(string columnName)
{
return _columnIndexes[columnName];
}
public void CreateColumns(DataTable dt)
{
foreach (var pair in _columnTypes)
{
dt.Columns.Add(pair.Key, pair.Value);
}
}
}
As you can see, there is a DataTableReplacement class which uses three other classes instead of DataColumnCollection, DataRowCollection and DataRow. For this example alone, the DtrRowCollection could have been easily replaced with a List<DtrRow>, but I wanted to allow people to replace DataTable wherever they had written code without any change to the use code.
In the example above, on my computer, it takes 1300 milliseconds to populate the DataTable the old fashioned way, 1000 to populate it with BeginLoadData/EndLoadData, 110 seconds to populate the DataTableReplacement. It takes another 920 seconds to create a new DataTable with the same data (just in case you really need a DataTable), which brings the total time to 1030. So this is the overhead the DataTable brings for simple scenarios such as these.