The client had accrued a considerable number of windows task-scheduler batch-processes over time, but really did not have any documentation of what they all did. This exposed them to a number of risks; one of the most considerable being that in the event of a disaster, reconstruction of the task list could have been problematic. The client’s parent company were pushing through a number of resilience projects, and one thing they wanted was the batch processes to operate in something-like a clustered manner; if one server failed, another would take over.
Separate, but Integral
Due to scheduling constraints, we were confident that this was not going to be a project where ‘bells and whistles’ would be added; as soon as the system was functional, we would be moved onto other projects. So we wanted something that could be considered to be a separate sub-system, but to reduce overhead right-now, we could implement it in the main system’s database. Although the main system could conceptually query job-run data, we would not allow that to maintain distinct roles.
We also needed a system that could work for multiple different environments on a single server, and we felt that precluded the use of a Windows Service (where you can run only a single service of a given name).
Choosing Strategies
We developed the scheduler processes to use the database as their data-store, and we used Named Pipes to connect the client ‘Controller’ processes to the schedulers. Although the initial interactions with the schedulers were to be with command-line interfaces, communication layers were separated into distinct libraries so that it would be relatively easy to add controller UI’s embedded in the administrative pages of the application’s main website. In addition, while running processes would only ever be done by a single ‘primary’ scheduler, the code was written so that any number of servers could collaborate in this way.
In Practice
In nearly two years of this system running, the pairs of schedulers have behaved very well. As we suspected, further development on the system has been very limited, so some of the promise has not been fulfilled, but the business now has several areas of functionality they did not have before:
- A single structured and up to date list of jobs that are run;
- Resilience in the face of hardware errors and intermittent downtime such as operating system patches;
- Daily reports on jobs which ran, and which failed (or exited with an unusual code);
- Weekly reports on jobs which have trends in their runtimes which are concerning.
Overall, we think the development was very worthwhile, but it is always a little sad to see a project not continue and grow to fulfil the hopes you had for it. One especial shortcoming is that new jobs currently have to be added by SQL to insert a new Job record; and we had hoped this would have been possible via a UI by now (either the Controller command-line application, or in a website).