Using Live Data as a Source of Test Databases

Many modern development tools are providing ways to create databases and populate them with test data; often with the idea that unit tests can then be run against them. But there is an alternative approach available to some people; which is to use live data as a source for our test environments.  Now, there may be reasons why this is not possible (not the least of which is ‘compliance’), and there are certainly issues of practicality that will need to be considered, but if you are allowed to do this there can be huge benefits.

The client’s development and test databases were woefully out-of-date; frequently causing delays as developers or testers found subsystems that were failing due to a data model that was out-of-step with code, or perhaps that there was no sample data readily available.  For example; if your sales team set up a new ‘broker’ in live last week with particular new settings, it might be convenient to know it was available in test too, so you could check the impact of that particular setting in relation to analysis of a bug.  But in this case, the test databases were shared, and that meant that there was often at least one developer or tester arguing against it being refreshed – for example, refreshing an Integration database might impact testers who set up customers (or whatever) ready to test some other thing the next week.

The DBAs strove for a considerable period to get agreement to refresh the ‘test’ databases on an ad-hoc basis (of which there were several), but finally we realised we were asking for the wrong thing – in fact we needed to ask for more control! We changed our tack to ask for the permission to do this weekly… and got agreement that we could try this.  The impact for all staff would be that any test data created in the week would be overwritten on Sunday, but the advantages would be that the databases would be in a well controlled state – with a far better understood life-cycle… and because this would all happen regularly there was no real challenge in remembering when these consequences would need to be considered.

Reasons to Source Data from Live

  • If the system has any kind of ‘bad’ or ‘historic’ data that is no longer actively created or used, it will still be present in the test databases.  If you are populating a test database with test data, will you bother to create sample data for something old and outdated? What if there are real errors in the data?*
  • Maintain the volume of data as seen in live – this won’t always be possible – but the more realistic your database is in size, the more likely you are to pick up performance and other issues in test stages. We found this to be especially true of ‘migration’ scripts to upgrade the database when code was released – having realistically sized databases made it far more likely we would detect performance problems with those scripts, which were invariably run with the system running;
  • Maintain the balance of data – if 10% of your business is process ‘x’ – it may be advantageous for developers if that is what the test databases showed too – we have certainly seen cases where developers removed functionality that appeared to be historic despite the fact example data could be seen in the test environments (e.g. they did not do proper analysis of ‘historic’);
  • Depending somewhat on the client, the live database may be more likely to have data in all sorts of states, at the start, middle and end of system processes (not if they are transactions!).  This may be advantageous in allowing testers to use real examples of customers to research problems, rather than having to create data to test with… it may be easier to create tools or SQL to help them find data in given states, rather than have them construct it (bear in mind;
  • If test databases actually reflect live data then you can test problems reported by customers on data that is essentially the same, but on a test database (thus not impacting the customer at all).  This might sound a tenuous argument – can’t you recreate data like the customer record to simulate the issue the customer saw and reported?  Well, yes you can, but recreation of complex data accurately often gets overlooked or bypassed, meaning that if the problem is caused by some weird interplay of apparently unrelated data, you may miss what you are looking for.

Reasons to be Careful

  • Depending on your local legal system, and your terms and conditions, you might be prevented from duplicating customer data into test systems.  It turned out that this client was not prohibited from doing so, but we decided to depersonalise the data anyway.  Finding personal data and morphing it into data that was very obviously test data was time consuming but worthwhile, and something we improved as we went.  Some data, especially XML fields, we took a safety-first approach and removed the data altogether;
  • It may not be possible to have several copies of the live database due to size constraints; even if you only want one copy in a performance checking environment, a database of several hundred gigabytes takes some horsepower to move around.  We were forced to reduce the database size because we needed several copies running on our staging server.  We managed to shrink the database dramatically by carefully selecting tables to truncate or dramatically reduce the number of records in that tended to be more ‘logging’ type tables that of particular importance to the business.  However, if you have a network of relationships to disentangle, removing data might be deeply problematic for you;
  • Obviously all this data processing takes server and network time to shift round the system, process, and shift on again.  We did our compress and data-masking within our live database subnet, and only then shifted the compressed and reduced backup across to the testing subnet;
  • Despite the clear procedures we put in place over the refreshing of the test, development, integration test (etc) databases, testers did occasionally forget this. Overall, though, the improvements in the understanding of the state of the test databases was a considerable achievement.

How We Developed This

The client had a pair of databases that needed to be managed as a pair in every environment.  We had written a set of C# utilities that utilised SQL Server ‘Server Management Objects’ (SMO) to control restores of these pairs of databases.  Anyone who has automated database restores will know that you need to control where database files are restored to (unless your computers all precisely echo the live environment) so we built quite a lot of configuration over where to restore files  for these database pairs.  We used this utility to manage daily restores of the database, and then we used SQL and T-SQL managed by a SQL Agent job to process the data shrink and blackout step-by-step – we generally ran this twice per week.  The final steps were to take a new backup (we relied on the powers of Quest LiteSpeed which compresses backups very well) and then used a variety of ftp and file copies to move these backups to the required testing servers.

Overall, the process was undertaken in an iterative / agile manner; delivering a workable solution quickly before addressing additional issues (e.g. removing data from the database came only when it was really necessary to do so due to disk space constraints).


This was a very worthwhile process for the client, and helped developers, testers, and the support team immensely… it seemed particularly valuable for this older system which had quite a lot of ‘bad’ historic data in – data that no-one would ever have bothered to recreate if you were populating an empty database.  The fact that live data was generally present in the test databases (perhaps with different name details) was very useful for problem analysis and product support.

While a more modern system with more controlled development in the early days will not benefit so much from having this historic data, if you value performance testing at any point in your process you may need to embrace this process!

As for the client, it was not the end of the road. We assisted them with setting a development roadmap for them so that every developer can create a model database on their local machine, constructed from a data model stored in version control.  However, it is likely that they will continue to use test data sourced from live for integration and UAT for the foreseeable future.

Under the client's proposed Developer Database roadmap, the developers will create per-ticket databases on their own machines, but progress to using databases with data sourced from live for later steps.

Under the client’s proposed database roadmap, the developers will create per-ticket databases on their own machines, but progress to using databases with data sourced from live for later steps.

One thing we were pleasantly surprised by was how quickly colleagues accepted the shift to the new process. It was remarkable considering the resistance encountered when we were discussing the ‘ad-hoc’ refresh model, that they were willing to consider it done weekly. Perhaps this change in attitude was due to an understanding that this would be a real improvement in development processes?




* Sometimes this is the way things are. A legacy system is known to have bad data in, but you simply don’t get the opportunity to fix it. In fact, the system may not be ‘legacy’ but the functionality might be due to changes in business process.  In this case, the customer had ‘test’ data in their live database that dated from a period when they were just starting out, only had a live database, and frequently did any testing on live with dummy data.  We simply never got the resource to investigate that old test data (but even if we had, there would still have been ‘bad’ data from poor implementations from those early wild-west-development days.