Sometimes the success or failure of a new system function can end up depending on one feature; and if that feature or function is missing the whole reason for the system existing can be undermined.
This system we are looking at today is one which can track vehicles, and one feature is the ability to plot journeys on a map in a web-browser; it’s actually very cool, but the plotting of the route was a little slow. Inside the system there is a record of significant points in the journey in a Gps table… and in addition there is a record of ‘journeys’; which is to say an estimation of when movements reflect real journeys. This is because a lot of effort has to be put into deciding if you are really moving or not because even a stationary vehicle will report fluctuating position over time. The system attempts to intelligently assess the purpose of a journey based on data points like when and where it started – and similarly when and where it ended.
At some point, it was decided to improve the route-plotting. The basic principles under which the new plotting tool would work were as follows:
- The map would be split into various pre-defined ’tiles’. The tiles are basically the same number of pixels wide and high, and at each possible zoom value each tile will have spatial coordinates which allow drawing of whatever data needs to be drawn.
As an example, for a normal map, most of the ‘map’ data is static and so this allows each and every tile to be cached for the long term; - The application already worked this way in terms of static map tiles being served by Google, and the site would draw its own overlays to draw important site data and so on;
- The new route plotter would also use ‘overlay’ tiles, and would add a caching capability so that once the route plot was drawn for an overlay tile it would be cached and able to be fulfillled from cache should it ever be needed again…
- Therefore most tiles would be fulfilled from cache and obviously this would make the system faster.
In addition to the route plotting, the same system would be used to plot overlays for mostly-static things like client site locations and so on.
Misunderstood Usage Patterns
The first error in judgement was the likely use-case at the level where the developers decided to apply caching. For example, whenever we used the system to view a journey plot the following would occur:
- Journey plot (generally for a full working day) would be requested;
- The map and display would resize and redraw to display the whole set of journeys covered on that day by an employee;
- Often the ‘zoomed out’ view necessary to see the whole day left enough detail unresolved that we would want to zoom in to a part of the route plot to see the detail, and then we might drag the map around to see different features;
- Shortly thereafter, we would have seen enough of the route plot to warrant closing it.
Note how the use pattern was actually going to use the cache very little. The initial display of the route plot fetched data and displayed it on tiles at one particular zoom level. Then, for every zoom level the user zooms through, the same thing happens; this particular route has not yet been drawn at this particular zoom level, so the whole process repeats with new tiles. As the user pans around the map, new map tiles are revealed so again they need to be drawn to be cached… and so it is only when the user pans around and back over a view they have already seen that they begin to take advantage of the cache… but even so, the cache lifetime is likely to be set to just a few minutes so it is unlikely that anyone else will reuse this cache in that time anyway! (if anyone else even had permission to do so).
Often, we might follow this route plot with another for another person or another date, in which case the whole process would repeat without the benefit of a cache.
We believe that the tiling implementation used by Google means that the main map bitmaps are served and cached very effectively; each image will likely be held in memory or cache and be reused millions of times before anyone thinks to redraw the map (though of course there might be different tiles for differnt map views such as roads or topographical map versions)… but implementing a similar cache for this particular system really did not seem to gain the same benefits.
Misunderstood Workload
I avoided any mention of where the plotting process was getting its journey data from previously, but this process based on tiles actually left the querying of a database to the code that drew an individual tile. The perceived benefit was obviously that ‘the system will only query the limited data it needs to draw a tile’ and thus ‘obviously it is being very efficient’… and this might seem to make sense as you imagine a map-view being dragged, a tile being requested, a map overlay being drawn, and of course just the limited amount of data necessary for that tile should be requested. But in practice, we believe that the spatial nature of the query meant that querying was generally a little slow, and the round-trip to the database for every little query would have added some considerable overhead.
Overall, we believe that a large amount of time was wasted in the design of the system by misunderstanding the complexity of the initial query to get the route data. When it was run just once for a whole day it tended to take considerably less than a second; but a refinement of it ended up getting run sometimes hundreds of times. The addition of spatial criteria simply would not have made it run orders of magnitude faster; in all likelihood spatial filters would make it slower! In short, the new system consumes more time querying the database than the old… but at least the caching layer is going to improve user perception and usability, right?
The One Missing Feature
So, we’ve spotted some potential problems with the system but at this point we’re hoping that the caching of map tile overlays is going to save us. But it doesn’t; because it isn’t even switched on!
Say what? Why isn’t the cache layer switched on; Caching has obviously got to be a good thing hasn’t it?
Well yes; if it had a way to communicate with the rest of the web site and update as and when necessary; but by-design it sat separately from the rest of the website and was unaware of changes made there (essentially the tile process had been retrofitted to the main site by overriding javascript functions and so on). In addition to displaying route plots, the same ’tile’ system was used to draw other overlays on the map such as client sites, and if and when a user added a new site, without the ability to inform the tile process of this, the site would not be displayed (inevitably leading to lots of errors where the useer would then be puzzled and try and recreate the site a further number of times before giving up, causing all sorts of issues such as duplicated sites with similar names and so on). We wrote about this in our post ‘Two Hard Things‘ (where one of the things was cache invalidation).
Thus; cache had been switched off to allow sites that were added to the system to be viewed instantly… because the ability for the site to inform the tile processes that they should refresh their cache did not exist!
Summary
This project featured a number of errors of understanding and appreciation of the consequences of the strategy the developer used. By missing just one non-obvious key feature, another key feature had to be switched off, considerably undermining the value of the sub-system. Furthermore, the changes will have lead to increased workload on the database and more overall time spent waiting for potentially thousands of queries to execute.
The analysis here has not been unduly harsh; for example you might recall this: “The map and display would resize and redraw to display the whole set of journeys covered on that day”. Guess how the system managed to know how far the display view needed to be zoomed-out? By running a query to fetch all the data for the whole day, of course! In other words, all those thousands of queries were already duplicating the one single query that was actually all that was necessary to be run!
Bizarrely, the client is not overtly aware of the problems this new system has introduced. In fact, the older system it replaced was so very inefficient on the original query of route plot data, that the whole idea was presumed to be unfeasible… but ironically they had a much better caching implementation that meant that once the data had finally been retrieved and processed, panning and zooming the journey plot was actually much quicker with much less load on the servers!