On a recent project I was told by a colleague about a certain SQL query generated by entity framework, that was ridiculously out of hand. Entity Framework allows you to pretty easily create a simple Data Access to the Table Per (Sub) Type pattern.
What this means is that you may have an inheritance of both a Student and an Instructor, derived from a Person, and query to retrieve a strongly typed object. So here’s where performance & optimization comes in. There’s a couple of ways to query against this data model.
Method 1: Implicit Typing
var query = from p in Persons
where p.PersonID.Equals(_personID)
select p;
Method 2: Explicit Typing
var query = from p in Persons.OfType<Instructor>()
where p.PersonID.Equals(_personID)
select p;
They seem pretty similar, however there’s quite a significant difference in what gets generated. By using method 2, you wind up letting Entity Framework know exactly what table it’s querying against. Which means your SQL code looks something like this:
SELECT
PersonID,
Column1,
Column2,
Column3
FROM
Instructor
WHERE
PersonID = @PersonID
However when you don’t specify the type, Entity Framework constructs a SQL query intended to make SQL go and figure it out (keep in mind that there’s no automatic discriminator column – it figures out type based off of the primary key – the ID column. More on this in a minute). The generated code looks something like this:
SELECT
PersonID,
Column1 as [0x01],
Column2 as [0x02],
CASE WHEN [1x01] IS NOT NULL THEN CAST(INT, [1x01])
... many more case, casts, for every column in every table ...
FROM
Person
UNION ALL SELECT
PersonID as [1x01],
Column1 as [1x02],
Column2 as [1x03]
UNION ALL SELECT
... many more unions for every table ...
WHERE
PersonID = @PersonID
Now you can see that this query gets very complex as a product of:
a) the number of subtypes
b) the number of columns for each type
The query generated gets the Cartesian product of all columns, and looks for the one where the key isn’t null – that’s the “winner” subtype. I imagine (haven’t yet tried this) that having a nested subtype involved here (like BusinessStudent in the linked example above) would cause an even more ugly nesting of the union within another union statement.
Now back to the point of this article – performance. How bad is what we see above? In an empirical example, thanks to JetBrains’ Dottrace and nunit tests I observed averages of:
Method 1: 126ms for the query to run
Method 2: 25ms for the query to run
I had then discovered the benefits of precompiling Entity Framework view code to optimize the SQL generation. This bought me roughly 26% gain in performance for the specific empirical examples.
Method 1: 100ms
Method 2: 20ms
Now we have roughly 80ms to play with – if the code to get from Method 1 to Method 2 (we don’t know the type that we’re retrieving, however we want the optimized query of Method 2) is more than 80ms, then the performance “fix” will be worse than the problem.
So far, given the constraint of EF (for now), and the Table Per (Sub) Type pattern, the only solution that comes to mind is reflection – this would involve a stored type as a discriminator column of sorts, then reflecting on that type, and calling the generic Person.OfType<T>() method via reflection. This costs us an extra query and reflection – neither of which are cheap. A separate empirical example (not the same code as the first) brings the total cost to ~350ms, a net performance loss of 250ms.
Method 1’s performance would have to degrade (through additional columns/subtypes) by ~250ms more in order to justify rolling a custom discriminator and reflecting to grab the subtype.
This was a pretty interesting exercise in when not to make performance optimizations that you know will need to be done long-term.