Performance Analysis by Tracing Side Effects

People sometimes claim that if a performance measure has not been recorded directly, then it is not possible to report on it.  Need to know how long a batch job took to run? That’s ok – I have a table that tells me job run times.  But if you need to know how much work a batch job did?  We don’t record throughput, they say.  This may be followed by a suggestion that they could add log messages or something like that.

We recently helped analyse performance of a batch job using side-effects.

The data that was readily available was this; the job run-times:

The duration of the daily run over a number of days.  The maximum duration of 120 minutes was enforced by the Scheduler

The duration of the daily run over a number of days. The maximum duration of 120 minutes was enforced by the Scheduler

By identifying the general operation of a process that was performing poorly, we started by identifying some records that could act as a marker for its operation.  It’s not always possible to do this, of course, but if you look for markers, you can often find something that will help.

Now we could trace the number of records processed in a particular run:

The graph records our assessment of the number of records processed per run.

The graph records our assessment of the number of records processed per run.

The graph reflects the number of records processed per run. When aligned with the previous graph you can see some improvements in processing align with decreases in runtime; these represent real performance improvements, but unfortunately the same can be said with performance reductions.

This particular job started with a large select, which lowered the recorded throughput in the first minute or so; but this gave us enough information to show we had a reasonable approach to collecting data.  The job runtime and the overall throughput gave us an ability to graph the average for the run over time.

The graphs produced so far showed some unusual behaviours, such as a wildly varying runtime. Could we learn anything more from reconsidering the data we had?  The answer was yes: we could plot the surrogate performance of individual job runs!  As this was a manual analysis, we used a simple spreadsheet to collate the data from different runs. In this interpretation of the data, we can see the throughput of specific runs over time:

Analysis of each minute of individual runs is interesting, showing dramatically different performance profiles.

Analysis of each minute of individual runs is interesting, showing dramatically different performance profiles.

Again, the varying results on different days does hint at some dramatic issues with performance.  The date-based information could help developers identify when certain changes had gone live, and could perhaps be useful in identifying the cause of slow-downs, meanwhile the general performance level showed an interesting characteristic of slowing-down as the run progressed. This was a concern because in essence, processing a record should – on average – take the same time towards the end of a batch as it does near the start; yet each and every line shows the jobs slowing.

When it’s possible, monitoring code via its side-effects can be a rewarding process, and goes on to offer baselines against which we can judge future ‘performance improvements’.