IT Wisdom Misunderstood: Data Stores should be Flexible to All Possible Future Data

Today I am continuing my reminiscence over system design problems I have addressed in the past, this time looking specifically at a data-structure a colleague of mine used in a particular situation, and the consequences of that decision.  Let’s begin!

The scenario was that of running expensive credit-checks for loan applicants.  The ‘Risk’ team wanted easy access to the results, both for processing on-the-fly as part of the loan application decision process, and also for future analysis to see if they could identify markers for successful repayment.  Credit-checks like the ones my client used are expensive, and can return hundreds of varying fields in complex data structures and the general thinking about the process was something like this:

  • There are about 30 key fields we regularly retrieve (but not quite always) and we would like access to those fields to be easy;
  • We may need to add more important fields to this list of easy-access fields in future, so we don’t want to commit to a fixed structure – ‘that would be wasteful’;
  • We have to store the full result set for later analysis as XML, as we paid a lot for it and it may prove to have useful fields deep in the belly which we may need to access one day.

While I am writing this in 2017, the environment I am referring to was at a client from about 2008 – 2010, at a time when the client was using SQL Server 2005.  While later versions of SQL Server have introduced functionality that might have helped us here (such as Sparse Columns) they were not available to us then.

The basic solution chosen was to use something like three tables: one to identify the general Application, one to identify individual generic and key attributes about that application (ApplicationAttribute), a lookup table for key types (AttributeType), and probably at least one other table to store the full XML (however, I will not discuss that XML table further in this post).  In fact, there may have been a fourth table with extra stuff in, but that is beyond the scope of this post.

Flexible Data ModelThough they had been designed thoughtfully, I hope to prove that some key aspects were neglected.  Let’s examine some of the problems:

  • Inserting records was expensive: due to the database design, about 30 records had to be written for each and every customer application.  For reasons of the underlying technology, this was basically only possible with that number of round-trips to the database… while it would have been possible to modify things to write a number of records in a single SQL statement, that never happened because it would have taken considerable effort and had to bypass the ORM in use at the time;
  • Parsing data types is a pain and caused surprisingly deep and costly consequences. For example; lets just say that one of the attributes we wanted to store was an integer CustomerId.  An ApplicationAttribute record would be written with the relevant AttributeType set, and the necessary INT column populated.  Subqueries would join AttributeType to ApplicationAttribute to get e.g. ‘CustomerId’ AttributeType, and access the INT column.  This value would then be used in the ‘outer’ or ‘master’ query to link to the Customer table. Even in cases where the INT value was guaranteed to be populated (and guaranteed to be a CustomerId) this led to all sorts of problems with queries working mostly but not always.
    The query had been written with the JOIN from ApplicationAttribute to AttributeType first (to get ‘CustomerId’ attributes) in an inner query, and only then join to the customer table.  With this expected order you are ‘guaranteed’ to link to legitimate customer data, you would think, but it turned out that the query optimiser considered the complete query text, and did not consider the subquery first.  Therefore, it would sometimes determine that the joins would be more efficient if it attempted to join the Customer table to the ApplicationAttribute first, and only then join to AttributeType. Naturally this would fail as soon as a record with a null INT column was encountered!  Of course, this table design also ensures we can not apply Foreign Key relationships from the attribute table to the parent table.
  • Queries against the tables by the Risk team typically created one very wide denormalised table.  The SQL necessary to achieve this would require 30 or more joins to the same table, just to get those main columns, in addition to joins to many other tables to retrieve data from them. In short, their queries were often hugely repetitive and lengthy, and slow to run;
  • No consideration had been made regarding key overhead.  A single application would lead to (say) 31 rows of data being stored (1 master with 30 child rows), with each row having an 8-byte Id column, and many null columns on all those Attribute rows.  In addition I recall there may have been some other overhead in the form of ‘UpdatedAt’ column values on each and every attribute record.
    In practice, in this particular scenario, this was itself an unnecessary overhead as the time was always the same as or within a few milliseconds of the time stored on the parent Application table!  One of the huge potential benefits of this data structure would have been if data was coming in over an extended period of time… but in actual fact this data was only created at one time (per application).
  • Although it is somewhat covered by the previous point, the developer had made a mistake when they thought that a fixed table structure with some nullable columns ‘would be wasteful’.  While there is a small overhead to store null values in SQL Server – it is something like ‘2 bytes per record plus 1 bit per nullable column’  (from memory).  This slight overhead is very efficient compared to their solution of many records with the key overhead discussed, which themselves relied on NULL column values!

Summary: I dont want to be too critical of the design because in some respects it seems ‘right’ and when you start thinking about applications with many types of attributes, and perhaps also not being certain what you may be asked to store in the future, this sort of table structure may naturally enter your thoughts.  As it turned out, however, this system did not experience much change over the next couple of years… thus the supposed flexibility turned out to be somewhat unnecessary.

What We Did Next

Moving forward a couple of years, the business made the decision to change the credit rating agency they used, and also the way applications were scored… and it became apparent that the credit-score data (and many other fields) were ‘similar, but different’.  As a trivial example, both agencies provided a credit score, but they were not equatable or on the same scale, so all the risk assessment code had to be changed to refer to the new type of score.  In short; everything of importance within this subsystem was changing — and even though the existing table structures could hold the new data with the new names exactly as it was designed to do, it became apparent that everything else needed to be changed and thus the whole approach to the problem should be reviewed.

In short; we picked a design with a single table structure with all the original columns from the ‘Application’ table plus approximately 30 columns that were required for Risk analysis.

A Single Table

Here were some of the pros and cons:

  • We saved considerable amounts of ‘insert’ time, all the data for a single loan application could be written with a single round-trip to the database (I would like to claim a 30x speed boost but in honesty I don’t ever recall proving that);
  • In a general sense, the table was considerably easier to query because it did not take 30 joins just to get the basic 30 (or so) columns;
  • In addition, Foreign Key constraints could be enforced, and we removed all possibility of problems in table joins based on the order of query processing; now we had a CustomerId column that might be null in some circumstances, but if it had a value it was always going to be ok to use that to join to the Customer table;
  • Simple to revise; in the following year or two I think we did adjust one column to be nullable because its meaning changed, and added a new column with the slightly adjusted meaning. Old records kept whatever value they had in the original column, newer records always set that to null and optionally used the newer column.  While it is true that a similar change in the prior system would not have had to make these particular changes to the data model, they still would have had consequential changes for rows in the AttributeType table, and all the complex reporting queries would still have had to have changes made to the values.  In contrast, under the newer system, minor changes were necessary to the field-name that was used, but the queries were much smaller so easier to read, understand, and modify!
  • Perhaps the biggest downside with this approach is the fact that there may be a time when you will need a completely new table structure… and potentially another time after that when the process will need to repeat, and so on!  I will address this concern below.

Summary

I want to comment first on one thing that the first approach got right, in my opinion. I believe it was correct that key data should be parsed and verified and stored in specifically-typed fields early in the process.  While I am sure that some people have used Xml types and indexes to query xml sources in table c0lumns efficiently (or ‘document databases’ or whatever), I have yet to see it.  Without going in to detail on alternatives to that, or problems of not doing it, it can be very worthwhile to pre-process data the ‘right amount’ to find the balance between up-front processing cost, storage cost, and querying cost (and the amount of times data will be requeried and reexamined).  In theory, much of this data was written once, but re-read multiple times over the following years.

I noted that the newer solution had the unfortunate side-effect of potentially requiring a new table / data-model every time there are major changes to the system… but I believe that we should try to resist that concern. My reasoning is that, given enough time, any system is liable to change in ways that can not reasonably be predicted now.  Perhaps your application volume will increase 10 times, or 100 times, forcing changes to enhance performance or change the way data is stored, or archived. Perhaps volume will shrink and you will be forced to apply all sorts of ‘cost saving’ measures.  Or, perhaps the provider of your data will change and things will need to be reviewed when (and only when) that occurs.  Therefore, with the one concern addressed, I hope I have convinced you that this latter ‘simple and naive’ (dare I say ‘agile‘?) solution is actually the better one (at least for this scenario).  In fact, we even experienced some immediate benefit from this strategy, as it provided a very natural deprecation in the old table usage, which allowed us to tidy-up the database without going through extensive archiving issues. We simply passed the ‘old’ tables to the Risk team for their use, and removed them from the production database.

I am sure that there are circumstances where having the capability to store variable types of data is advantageous, and a solution like the one originally chosen here may be necessary — perhaps if the changes are far more regular than we experienced, as ususal ‘your mileage may vary’!

Finally, would I do anything different with later versions of SQL Server?  I suspect not; though I would certainly research the possible use of Sparse Columns, I suspect that the number of columns in use at this client was not large enough to warrant that, and the solution we used would still hold.