Fixing Fragile Dependent Processes

The Client had windows services that processed messages from subscribed devices.  There were two services: one which did the main processing of messages that primarily processed details of the message type and time, and the second service processed GPS positions associated with each message.  For example: a message might indicate that a journey had ended, and the first service would calculate the consequences for ending a journey, and the second service would figure out where the journey had ended and the consequences of that.

In general, messages came in no more than once per minute per device, and the result was that most of the time, processing a message had plenty of opportunity to be completely handled by both services before the next message came in.  However, devices sometimes failed to send in messages in a timely fashion and might end up sending in many at once.  In these circumstances, even though individual messages were always processed in the correct order within a given service, the pair of processes could not guarantee the correct order to process all aspects of each message, so outcomes might vary from one day to the next.  The result of these complex interactions was odd results to processing that could only be understood when considering the potential progess of two different services through a number of messages at a given point in time hours or days ago!

The Project

I identified the problem, documented it, and was able to reproduce it through the use of a test system that could repeatedly send in the same messages every day.  By adding in small delays to the daily message repeats it was possible to identify issues with determinism.  I took on the project to integrate the processing from both services so the complete order of processing was guaranteed.

Actions

The main aim of the project was to integrate the processing from the ‘spatial’ processing service into the other service.  A number of business functions had to be moved, with analysis undertaken for each and careful decision-making for each function.  In addition to the main reorganisation, a new way of timing a number of occasional background tasks was implemented that recorded the real-worls runtimes of each process to inform the next runtime.  In addition, extra logging was added to record key timings from the processing of messages, intended to allow the addition of detailed reports later.

Results

The resulting system could guarantee that processing for each message was completed in full, and in sequence, with the result that processing for a given set of messages was deterministic, making for far fewer surprises and support investigations.  The test infrastructure that had helped prove the existence of the issue in the first place was naturally used to help confirm that the problem had been resolved by this project; and with this issue removed that could become the basis for far more testing using this ‘playback’ strategy.   This project also resulted in a better understanding and ability to properly record ‘message throughput’ and the additional data that became available led to useful reports analysing the most intensive parts of the processing.  Ultimately, overall throughput of the system could then be improved by a later project that made message processing multi-threaded, which was only possible (and reliable) due to the success of this project.