IT Wisdom Misunderstood: ‘SELECT *’ Is Bad!

Select Star is BadI was reminiscing over my last two main roles (which together account for 10 years!) and I reminded myself of some of the challenges that I faced in that time.  In two separate businesses I found coding styles and system-design scenarios that seemed to have been borne from some nugget of wisdom that seemed to me to have been miunderstood, and had led to code that had huge problems.  As ever, I do not want to pretend that the lessons I learnt from these scenarios can always be applied; I imagine that in-general my conclusion will be something like ‘your mileage may vary’… but I hope that I can at least demonstrate some harm came about from these misapprehension in the given scenario.

Today I want to discuss the idea that ‘SELECT *’ is bad; and what I really mean in this context is that ‘selecting all the data from a record in a database table is wasteful‘… with a possible additional proviso added ‘…if you don’t need it all‘. To try and put that into a wider context; the ideal some people have in mind is that we should only retrieve data from a database if we really need to use it.  It is probably also appropriate to mention that the environment I am referring to was one in which only very simple Object Relational Mapping (ORM) was in-use, meaning that selection of records was generally hand-written and mostly referred to a single table at a time, and not companies with huge volumes of database use.

I believe that this view leaves out several key factors; which will hopefully be obvious once described:

  1. What is the scope of ‘if we need to use it‘?  I have seen code where the thinking has often been limited to ‘this block of code’ or perhaps ‘this method’.  I have seen much code that requested different columns of data from the same database record within the same method… and in addition within the same call stack; for example method A calls method B, and both select different columns from the same record of the source database!
    I believe the better way to think about this is across the complete business process; that might be processing a message from a device from end-to-end, or detecting a record is in some specific state, and reacting to that appropriately (also from end-to-end) or whatever.  Here ‘end-to-end’ may not mean every single bit of processing, but ‘as much as is possible to do with the information we currently have, until your next save-point’;
  2. The simplistic nugget of wisdom entirely leaves out the cost of the round-trip communication with the database.  My assertion is that this round-trip cost is non-trivial, even with a database server on the same LAN as the server doing the processing… and even if the database has the data cached.  I also believe that in many cases, the cost of the round-trip is high enough that it far exceeds the cost of handling a few data values you may happen not to use in the calling code.
    Just to prove the point with an simplistic example, suppose my processing of some data uses 10 columns of data from a table in all circumstances, and an extra 2 column values (let us imagine an Int and a DateTime) in 10% of cases.  My strong belief is that it makes sense to retrieve all 12 columns all the time, even though some of that data won’t be used all the time.  Consider the overhead of running two queries across the network; even with per-record caching such as that used by SQL Server, the second query will require its own parsing, network overhead, and so on, compared to the additional cost of a few extra bytes being returned by the first query;
  3. Simple code is easier to understand than complex code… and all else being equal, developers should ideally prefer simplicity unless optimisation turns out to be necessary (but see Assumed Optimisation as one cautionary tale).  Having many different queries selecting different columns of data from a table will either tend to lead to duplication of the parsing code, or to incomplete population of a shared object (which can get hugely confusing as different columns only have meaning after different selection queries have been run) or a mixture of both.
    One of the wonderful benefits we can have with ORMs, be they ‘active record’ types or ‘poco’ is that they closely match some database table which should represent a logically useful collection of information!  I have enjoyed tools like Dapper.Net, Gentle.Net and even hand-crafted SQL Mapper functions which allow development of lots of different queries with different selection criteria, but which share the processing to load individual records into the in memory objects – in short, that all expect to have had all the same columns of data queried all the time;
  4. Similar thinking can be applied to the idea of making changes (e.g. especially UPDATEs) to database records, and can similarly lead to the creation of huge numbers of methods which will update one or a very few columns at a time.  The thinking seems to go something like this: “If selection of data I don’t need is wasteful, so must updating data I have not changed be wasteful.  It would be really complicated to write update statements that updated only the changed column values (as I would have to track what values were changing) so I will write really restricted update statements”. The result of course is that you end up with many different update calls to update different columns at different scenarios, and in some cases a single update to a record is carried out over several different method calls.
    Another problem with this approach is that it leads to problems identifying ‘when’ something happened; and will probably never be tied-in with any kind of transaction management.  Indeed, it might be argued that this alone is the biggest ‘cost’ of the misapplied wisdom; small updates lead to processes applying database changes over many lines of code in many methods, and thus it becomes very difficult to isolate and protect such processing with transaction.  I believe that batch processing should be possible to follow steps that i) select (most or all relevant data), ii) process using in-memory classses and structs representing that data, and finally iii) apply the changes necessary to the database with proper transactional protection (so all of the changes are made or none) and so on.

In my most recent role, I was able to remove literally hundreds of stored procedures and hand-crafted ‘mapper’ functions and replace them with far simpler mapper functions, or in some cases rely on Dapper.Net to do the mapping automatically for me.  While noting the warnings below, this has not presented any obvious issues to me so far!

When is ‘SELECT *’ Still Unwise?

Here are some points where I think use of ‘SELECT *’ is probably still problematic or worth more thought:

  • Where a table includes a large column type (especially a BLOB type but I would also be cautious of even a VARCHAR field if it’s average size was in the hundreds of bytes)… but again I would probably err on the side of simplicity unless I had strong reason not to;
  • Remember too that SQL Server internally stored larger objects in a separate structure from the main table it belongs to; that means that if you ‘SELECT *’ from that single table it will internally join to this secret table, which can massively impact performance.  Honestly I’ve never done that much work with BLOB types; but it is a strong reason to consider physically specifying that such data be stored in a different table. e.g. Instead of a Product table having an Image column, I might be inclined to have a ProductImage table to hold those blobs – that makes sense to allow many images per Product if you like, and also should make querying the core Product details (name etc) much quicker (consider you will likely only need to handle images in the UI, but functional business processing is also much more likely not to require the images);
  • When bad normalisation or table design has occurred, and a lot of unnecessary data is held in a table.
  • Reporting-type queries where larger numbers of records are being returned, often from several tables.  If a report is based on tens or hundreds of thousands of records, saving a few bytes per record may well be worthwhile; but again it should be clear that within the scope of a particular report it makes sense to try and select all the data you need from particular table(s) at once.  Again, I have seen some pretty horrendous report designs where the main query executed in less than a second, but follow-on queries were executed per row leading to the execution time of the report running into the minutes!
    However, be aware that a common strategy in reporting is to retrieve additional data so that a single query dataset can be used to service more than one query; how much this is viable will depend on your normal query / report result set sizes… and to emphasise the difference even more, such result-sets are often denormalised into a flat output set which contains columns from many tables, meaning that thousands of row values are repeated thousands of times (but makes some of the reporting tasks easier);
  • Occasionally, a batch process may be long-running, and the result can be that the actual processing of any one record may be separated from the initial query by minutes or even hours.  In these scenarios, it may be best to use the initial selection only to build some sort of work-list, and to re-query data as and when it makes sense to.  In one system I worked with, payment processing batches would run relatively expensive queries to figure out what payments were due to be taken in the batch, and then the batch itself might run for several hours.   As customers might make payments in that time, the payment status needed to be rechecked just before each payment was attempted.  Therefore, the first main query of the batch returned paymentIds only, and it was the later processing that re-queried all necessary payment details and attempted to take payment (obviously only if the payment had not been taken in the interim period). Note how the volume of work and the time processing took simply forced duplicate querying to be undertaken across the batch.  Doubtless such batch processes will be worth careful consideration in any event.
  • It is also worth bearing in mind that my experience is primarily with a database server on the same LAN network as the server(s) doing the work.  If for some reason your database is more remote than that, it will exacerbate these problems, and if it is closer (e.g. on the same server) it may lessen the cost of the database queries.

 

A Very Quick Note on Alternatives

I just wanted to acknowledge that some Data Access / ORM solutions go to other extremes; if I recall correctly NHibernate can automatically pre-select ‘related records’ (e.g. Child records; such as a history of payments made against a mortgage, or lookup values such as a ‘Mortgage Type’ value)… this can lead to repeated queries that are invisible to the developer but act to slow things down (see the Select N+1 Problem).

Also, Dapper.Net has the capability to run several queries in one batch and expand the results into different result sets… for example I could query the ‘Customer’ table for a particular CustomerId, and query the ‘Address’ table for the same CustomerId, and have both result-sets processed as one to give me the Customer POCO and an Address POCO.  BY approaching it in this way we can reduce the round-trips to the database still further.

In both cases, it seems the solutions to improve performance increase complexity (and thus should only be made with caution) and work by reducing round-trips to the database.  Hopefully this alone acts to confirm my arguments above; smart people with advanced tools have provided the means to reduce database round-trips; perhaps we can safely infer those round-trips have a cost to them that is worth avoiding in some circumstances.