As a SQL Server DBA I was fortunate enough to encounter the idea of a numbers table — principally due to the work of Itzik Ben Gan (e.g. in books like ‘Inside Microsoft SQL Server 2005: T-SQL Querying’)… but I’ve not found many internet resources for a numbers tables in PostgreSQL. I’m working with Postgres at the moment, so I thought it would be good to provide a little introduction.
What is a Numbers Table?
Quite simply; a table with sequential integer numbers in; typically starting from 0 or 1 (or perhaps, both, as we shall see)… we can then use this table to join to other tables to provide useful functionality.
Many sources on testing imagine that the developer, or someone close-by, knows exactly how the system works and how it behaves, and imagines that the writing of tests naturally means supplying known inputs to some routine, and checking the results against the expected outputs. But the truth is that for some code, what it does and how it works is effectively unknown … not only do we not know what it does, we don’t even know if it does it correctly!
A ‘Characterization Test’ is a test that you write not knowing what the code under test does or what the correct operation is. The tests will be written in conjunction with examining the code, writing test expectations in response to seeing the actual results; turning the normal expectation for writing tests on its head.
It’s been a while since I posted here… and I just wanted to document something I thought about quite a bit when I was hiring SQL Server Database Administrators (DBA); not every DBA has the same interests, and not every workplace has the same needs for a DBA, and while I initially expected certain things of some DBAs, I was often surprised.
So, please allow me to introduce you to the two main classes of DBA (in my humble opinion, from a limited sample size, etc etc)! Continue reading
I was reminded earlier today of the famous Donald Rumsfeld press conference in which he said:
“Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.”
There was a considerable uproar in the press at the time, which mostly seemed to focus on these words as if they were a radical new idea… but I think I just considered his words to represent a broad concept that seemed very familiar… probably so familiar in fact that no-one ever says them!
Today I am continuing my reminiscence over system design problems I have addressed in the past, this time looking specifically at a data-structure a colleague of mine used in a particular situation, and the consequences of that decision. Let’s begin!
The scenario was that of running expensive credit-checks for loan applicants. The ‘Risk’ team wanted easy access to the results, both for processing on-the-fly as part of the loan application decision process, and also for future analysis to see if they could identify markers for successful repayment. Credit-checks like the ones my client used are expensive, and can return hundreds of varying fields in complex data structures and the general thinking about the process was something like this:
- There are about 30 key fields we regularly retrieve (but not quite always) and we would like access to those fields to be easy;
- We may need to add more important fields to this list of easy-access fields in future, so we don’t want to commit to a fixed structure – ‘that would be wasteful’;
- We have to store the full result set for later analysis as XML, as we paid a lot for it and it may prove to have useful fields deep in the belly which we may need to access one day.
While I am writing this in 2017, the environment I am referring to was at a client from about 2008 – 2010, at a time when the client was using SQL Server 2005. While later versions of SQL Server have introduced functionality that might have helped us here (such as Sparse Columns) they were not available to us then.
I was reminiscing over my last two main roles (which together account for 10 years!) and I reminded myself of some of the challenges that I faced in that time. In two separate businesses I found coding styles and system-design scenarios that seemed to have been borne from some nugget of wisdom that seemed to me to have been miunderstood, and had led to code that had huge problems. As ever, I do not want to pretend that the lessons I learnt from these scenarios can always be applied; I imagine that in-general my conclusion will be something like ‘your mileage may vary’… but I hope that I can at least demonstrate some harm came about from these misapprehension in the given scenario.
Today I want to discuss the idea that ‘SELECT *’ is bad; and what I really mean in this context is that ‘selecting all the data from a record in a database table is wasteful‘… with a possible additional proviso added ‘…if you don’t need it all‘. To try and put that into a wider context; the ideal some people have in mind is that we should only retrieve data from a database if we really need to use it. It is probably also appropriate to mention that the environment I am referring to was one in which only very simple Object Relational Mapping (ORM) was in-use, meaning that selection of records was generally hand-written and mostly referred to a single table at a time, and not companies with huge volumes of database use.
I believe that this view leaves out several key factors; which will hopefully be obvious once described: Continue reading
The Client had windows services that processed messages from subscribed devices. There were two services: one which did the main processing of messages that primarily processed details of the message type and time, and the second service processed GPS positions associated with each message. For example: a message might indicate that a journey had ended, and the first service would calculate the consequences for ending a journey, and the second service would figure out where the journey had ended and the consequences of that.
In general, messages came in no more than once per minute per device, and the result was that most of the time, processing a message had plenty of opportunity to be completely handled by both services before the next message came in. However, devices sometimes failed to send in messages in a timely fashion and might end up sending in many at once. In these circumstances, even though individual messages were always processed in the correct order within a given service, the pair of processes could not guarantee the correct order to process all aspects of each message, so outcomes might vary from one day to the next. The result of these complex interactions was odd results to processing that could only be understood when considering the potential progess of two different services through a number of messages at a given point in time hours or days ago! Continue reading
Sometimes the success or failure of a new system function can end up depending on one feature; and if that feature or function is missing the whole reason for the system existing can be undermined.
This system we are looking at today is one which can track vehicles, and one feature is the ability to plot journeys on a map in a web-browser; it’s actually very cool, but the plotting of the route was a little slow. Continue reading
Some of the earlier posts on this blog use ‘we’ and ‘us’ to hint or suggest that 42 IT Solutions has a team of like-minded people working for a better future in IT. I’m sure there were any number of clues that this was ‘enterprise-speak’; but as things stand 42 IT Solutions is effectively only one person: me, Nigel Rheam.
Received an email from Amazon today, this email explaining that delivery of my parcel was delayed demonstrates the dangers of duplicate data, I think:
29 September 2016 is actually a Thursday!
The duplication I am thinking of is that in addition to providing the date; 29/09/2016, they have also chosen to provide a refinement; ‘Friday’.