Today I am continuing my reminiscence over system design problems I have addressed in the past, this time looking specifically at a data-structure a colleague of mine used in a particular situation, and the consequences of that decision. Let’s begin!
The scenario was that of running expensive credit-checks for loan applicants. The ‘Risk’ team wanted easy access to the results, both for processing on-the-fly as part of the loan application decision process, and also for future analysis to see if they could identify markers for successful repayment. Credit-checks like the ones my client used are expensive, and can return hundreds of varying fields in complex data structures and the general thinking about the process was something like this:
- There are about 30 key fields we regularly retrieve (but not quite always) and we would like access to those fields to be easy;
- We may need to add more important fields to this list of easy-access fields in future, so we don’t want to commit to a fixed structure – ‘that would be wasteful’;
- We have to store the full result set for later analysis as XML, as we paid a lot for it and it may prove to have useful fields deep in the belly which we may need to access one day.
While I am writing this in 2017, the environment I am referring to was at a client from about 2008 – 2010, at a time when the client was using SQL Server 2005. While later versions of SQL Server have introduced functionality that might have helped us here (such as Sparse Columns) they were not available to us then.
The Client had windows services that processed messages from subscribed devices. There were two services: one which did the main processing of messages that primarily processed details of the message type and time, and the second service processed GPS positions associated with each message. For example: a message might indicate that a journey had ended, and the first service would calculate the consequences for ending a journey, and the second service would figure out where the journey had ended and the consequences of that.
In general, messages came in no more than once per minute per device, and the result was that most of the time, processing a message had plenty of opportunity to be completely handled by both services before the next message came in. However, devices sometimes failed to send in messages in a timely fashion and might end up sending in many at once. In these circumstances, even though individual messages were always processed in the correct order within a given service, the pair of processes could not guarantee the correct order to process all aspects of each message, so outcomes might vary from one day to the next. The result of these complex interactions was odd results to processing that could only be understood when considering the potential progess of two different services through a number of messages at a given point in time hours or days ago! Continue reading
Sometimes the success or failure of a new system function can end up depending on one feature; and if that feature or function is missing the whole reason for the system existing can be undermined.
This system we are looking at today is one which can track vehicles, and one feature is the ability to plot journeys on a map in a web-browser; it’s actually very cool, but the plotting of the route was a little slow. Continue reading
We made a mistake recently, breaking one of our own rules; Be Consistent. Now, of course it is not always possible to ‘be consistent’, sometimes because you are doing something truly new; but often because one incorrectly sees differences – when you may be better off seeing patterns and similarities (and thus implementing something to fit an existing pattern)!
In the previous post, we hinted at a couple of examples of developer-managed-caches that may not have optimised a system very wisely. Today we will look at an example of a development team that intentionally wanted to optimise some process-or-other, and made mistakes in doing so. Continue reading
Many modern development tools are providing ways to create databases and populate them with test data; often with the idea that unit tests can then be run against them. But there is an alternative approach available to some people; which is to use live data as a source for our test environments. Now, there may be reasons why this is not possible (not the least of which is ‘compliance’), and there are certainly issues of practicality that will need to be considered, but if you are allowed to do this there can be huge benefits.
We have found that communicating our system designs with clients is most usefully done with diagrams rather than large chunks of text. Some years ago, we looked at using UML Use Case Diagrams for this communication – but see what Martin Fowler has to say about them in ‘UML Distilled’:
“But almost all the value of use cases lies in the content, [of the textual cases] and the diagram is of limited value.’
In other words, in his opinion, you should use the textual Use Cases, not the diagrams (which just map those texts visually).
‘Pure’ UML also seems to disappoint in terms of producing very dry monochrome diagrams with stick figures, and simple primitives such as boxes and ovals. Is this really the best we can do to convey the use of a system?
The client had accrued a considerable number of windows task-scheduler batch-processes over time, but really did not have any documentation of what they all did. This exposed them to a number of risks; one of the most considerable being that in the event of a disaster, reconstruction of the task list could have been problematic. The client’s parent company were pushing through a number of resilience projects, and one thing they wanted was the batch processes to operate in something-like a clustered manner; if one server failed, another would take over. Continue reading
People sometimes claim that if a performance measure has not been recorded directly, then it is not possible to report on it. Need to know how long a batch job took to run? That’s ok – I have a table that tells me job run times. But if you need to know how much work a batch job did? We don’t record throughput, they say. This may be followed by a suggestion that they could add log messages or something like that. Continue reading
Baselining is used in IT to record some measurement(s) with a view to comparison later. If you know some statistic on one day – say for example the average CPU usage on a server, then you can measure the same data-point the next day and draw a comparison. It is often used so that you can measure the impact of some change or other, and assess its success.
Today, I want to talk about the importance of baselining with a counter-example – what can happen if you don’t baseline before making changes to a system; especially where those changes were intended to improve performance! Continue reading