In the previous post, we hinted at a couple of examples of developer-managed-caches that may not have optimised a system very wisely. Today we will look at an example of a development team that intentionally wanted to optimise some process-or-other, and made mistakes in doing so.
A Big List Must Waste Time
A company had an internal web-site that was used in a call centre. One of the main screens used by the staff had a considerable quantity of customer information on it, and among these things was a table of notes made by call-centre staff and also automated notes generated by the system; staff were expected to read this to understand any recent contacts with the customer; especially if there was some ongoing communications.
The history of the problem went something like this:
- The client and the developers realised that there was a general issue with the page getting slower to load;
- The development manager said that the table listing customer notes must be slow, and it would be quicker if it loaded fewer notes;
- The client requested the change;
- A developer made the change;
- A tester tested the change;
- The change went live;
- Page load times did not improve and database performance got worse.
Up to the last point, we had not been involved in the process at all. The problem we noticed was not a huge one, but the query that populated the Notes table had been mediocre, but it became worse! In fact, per execution the performance was about the same as it always had been (perhaps it was a touch worse) but it was being run more frequently, increasing the overall impact – even though it was returning less data per execution.
It turns out that the code necessary to ‘paginate’ the results from SQL does not necessarily improve its performance. Firstly, in some circumstances, just asking for 25 rows of 100 does not necessarily mean you are doing 25% of the work; in some cases the database will still need to sort all 100 rows; the only difference now is that it has to send a bit less data back down the wire (so you are doing 95% of the work). This particular query actually overlaid the results from two different tables, and this complexity meant that ‘top (n)’ queries still took about 98% of the resource of the original query returning all results for a customer. Secondly when you paginate SQL results there is an implication that you may want to see a second page of data; now the database has to sort the rows and give you rows 26 – 50 (and guess what, it will have to select and sort rows 1 – 100 all over again). We could see from traces that staff were regularly needing to click a button to ‘get more rows’ and as such, the overall change slowed the system down (and it slowed staff down).
How did it all go so wrong? We believe because of the assumptions made in point 2 above; that returning less data would save time. That assumption was not physically true of the database; and it did not factor in the human-factors that staff needed to see an indeterminate number of notes, and selecting the ‘right number’ to show them would be very difficult. Secondly, the problem change was put live because the original problem (page load slow) was not mentioned on the ticket; thus everyone came to assume that ‘less data would be quicker’ (if they were aware of the original problem) or perhaps they were not even aware that the page load was meant to be quicker as a result of this. To demonstrate this even more clearly; initial implementations of the code change to the Notes control actually caused the page to refresh too; which caused hundreds of queries to be rerun again, making the user wait for the screen to refresh again!
While we were able to improve the database performance a little through indexing and query optimisation, and we got the developers to refresh the control asynchronously (without causing all other queries for the page to rerun) we could not persuade the clients to undo the damage done with the change. Indeed, some of the developers and clients seemed to become entrenched in the opinion that this one control needed a lot more work to make it fast, and a whole raft of changes were made to it to download more data, but only display some of it, and then add a search function (because it might be cool, not because it had been asked for) and introduced a whole new raft of problems in the process.
Official performance profiling of that page was never undertaken (while we worked for that client), but we would anticipate that the slow-down was mostly down to more controls and information being added to the page, rather than any one query. The notes control could certainly account for a fair chunk of the data displayed, but other areas of the page relied on tens of queries being run in turn to get the data necessary, and everything starts to slow down the more round-trips you make to the database!