In this post, I’m going to look at three types of data, and the implications for data management. For the purposes of this story, I’m going to associate these types with three contrasting metals: bronze, gold and mercury. (Update: fourth type added – scroll down for details.)
The first type of data represents something that happened at a particular time. For example, transaction data: this customer made this purchase of this product on this date. This delivery was received, this contract was signed, this machine was installed, this notification was sent.
Once this kind of data is correctly recorded, it should never change. Even if an error is detected in a transaction record, the usual procedure is to add two more transaction records – one to reverse out the incorrect values, and one to reenter the correct values.
For many organizations, this represents by far the largest portion of data by volume. The main data management challenges tend to be focused on the implications of this – how much to collect, where to store it, how to move it around, how soon can it be deleted or archived.
The second type of data represents current reality. This kind of data must be promptly and efficiently updated to reflect real-world changes. For example, the customer changes address, an employee moves to a different department. Although the changes themselves may be registered as Bronze Data, what we really want to know is where does the customer now reside, where does Sam now work.
Some of these updates can be regarded as simple facts, based on observations or reports (a customer tells us her address). Some updates are derived from other data, using calculation or inference rules. And some updates are based on decisions – for example, the price of this product shall be X.
And not all of these updates can be trusted. If you receive an email from a supplier requesting payment to a different bank account, you probably want to check that this email is genuine before updating the supplier record.
Typically much smaller data volumes than Bronze, but much more critical to the business if you get it wrong.
Finally, we have data with a degree of uncertainty, including estimates and forecasts. This data is fluid, it can move around for no apparent reason. It can be subjective or based on unreliable or partial sources. Nevertheless, it can be a rich source of insight and intelligence.
This category also includes projected and speculative data. For example, we might be interested in developing a fictional “what if” scenario – what if we opened x more stores, what if we changed the price of this product to y?
For some reason, an estimate that is generated by an algorithm or mathematical model is sometimes taken more seriously than an estimate pulled out of the air by a subject matter expert. However, as Cathy O’Neill reminds us, algorithms are themselves merely opinions embedded in code.
If you aren’t sure whether to trust an estimate, you can scrutinize the estimation process. For example, you might suspect that the subject matter expert provides more optimistic estimates after lunch. Or you could just get a second opinion. Two independent but similar opinions might give you more confidence than one extremely precise but potentially flawed opinion.
As well as estimates and forecasts, Mercury data may include assessments of various kinds. For example, we may want to know a customer’s level of satisfaction with our products and services. Opinion surveys provide some relevant data points, but what about the customers who don’t complete these surveys? And what if we pick up different opinions from different individuals within a large customer organization? In any case, these opinions change over time, and we may be able to correlate these shifts in opinion with specific good or bad events.
Thus Mercury data tend to be more complex than Bronze or Gold data, and can often be interpreted in different ways.
@tonyjoyce suggests a fourth type.
This is good. I’d like to suggest there is another elemental type, in the taxonomy we use for structures. A substrate for complex data like addresses. Call it glass, for it shatters when stressed.
— tonyjoyce (@tonyjoyce) July 10, 2020
This is a great insight. If you are not careful, you will end up with pieces of broken glass in your data. While this kind of data may be necessary, it is fragile and has to be treated with due care, and can’t just be chucked around like bronze or gold.
Single Version of Truth (SVOT)
Bronze and Gold data usually need to be reliable and consistent. If two data stores have different addresses for the same customer, this could indicate any of the following errors.
- The data in one of the data stores is incorrect or out-of-date.
- It’s not the same customer after all.
- It’s not the same address. For example, one is the billing address and the other is the delivery address.
For the purposes of data integrity and interoperability, we need to eliminate such errors. We then have a single version of the truth (SVOT), possibly taken from a single source of truth (SSOT).
Facts and derivations may be accurate or inaccurate. In the case of a simple fact, inaccuracy may be attributed to various causes, including translation errors, carelessness or dishonesty. Calculations may be inaccurate either because the input data are inaccurate or incomplete, or because there is an error in the derivation rule itself. (However, the derived data can sometimes be more accurate or useful, especially if random errors and variations are smoothed out.)
For decisions however, it doesn’t make sense to talk about accuracy / inaccuracy, except in very limited cases. Obviously if someone decides the price of an item shall be x pounds, but this is incorrectly entered into the system as x pence, this is going to cause problems. But even if x pence is the wrong price, arguably it is what the price is until someone fixes it.
Plural Version of Truth (PVOT)
But as I’ve pointed out in several previous posts, the Single Version of Truth (SVOT) or Single Source of Truth (SSOT) isn’t appropriate for all types of data. Particularly not Mercury Data. When making sense of complex situations, having alternative views provides diversity and richness of interpretation.
Analytical systems may be able to compare alternative data values from different sources. For example, two forecasting models might produce different estimates of the expected revenue from a given product. Intelligent use of these estimates doesn’t entail choosing one and ignoring the other. It means understanding why they are different, and taking appropriate action.
Or what about conflicting assessments? If we are picking up a very high satisfaction score from some parts of the customer organization, and a low satisfaction score from other parts of the same organization, we shouldn’t simply average them out. The difference between these two scores could be telling us something important, might be revealing an opportunity to engage differently with the two parts of the customer.
And for some kinds of Mercury Data, it doesn’t even make sense to ask whether they are accurate or inaccurate. Someone may postulate x more stores, but this doesn’t imply that x is true, or even likely, merely speculative. And this speculative status is inherited by any forecasts or other calculations based on x. (Just look at the discourse around COVID data for topical examples.)
Master Data Management (MDM)
The purpose of Master Data Management is not just to provide a single source of data for Gold Data – sometimes called the Golden Record – but to provide a single location for updates. A properly functioning MDM solution will execute these updates consistently and efficiently, and ensure all consumers of the data (whether human or software) are using the updated version.
There is an important connection to draw out between master data management and trust.
In order to trust Bronze Data, we simply need some assurance that it is correctly recorded and can never be changed. (“The moving finger writes …”) In some contexts, a central authority may be able to provide this assurance. In systems with no central authority, Blockchain can guarantee that a data item has not been changed, although Blockchain alone cannot guarantee that it was correctly recorded in the first place.
For Gold Data, trustworthiness is more complicated, as there will need to be an ongoing series of automatic and manual updates. Master data management will provide the necessary sociotechnical superstructure to manage and control these updates. For example, what are the controls on updating a supplier’s bank account details?
There will always be requirements for data integrity between Bronze Data and Gold Data. Firstly, there will typically be references from Bronze Data to Gold Data. For example, a transaction record may reference a specific customer purchasing a specific product. And secondly, there may be attributes of the Gold Data that are updated as a result of each transaction. For example, the stock levels of a product will be affected by sales of that product.
However, as we’ve seen, the data management challenges of Bronze Data are not the same as the challenges for Gold Data. And the challenges of Mercury Data are different again. So it is better to focus your MDM efforts exclusively on Gold Data. (And avoid splinters of Glass.)
Post prompted by a discussion on Linked-In with Robert Daniels-Dwyer, Steve Fisher and Steve Lenny. https://www.linkedin.com/posts/danielsdwyer_dataarchitecture-datamanagement-enterprisearchitecture-activity-6673873980866228224-Mu1O
Updated 15 July 2020 following suggestion by Tony Joyce.