Last week, someone asked me how I go about producing a data model. I found out afterwards that my answer was considered too brief. So here’s a longer version of my answer.
The first thing to consider is the purpose of the modelling. Sometimes there is a purely technical agenda – for example, updating or reengineering some data platform – but usually there are some business requirements and opportunities – for example to make the organization more data-driven. I prefer to start by looking at the business model – what services does it provide to its customers, what capabilities or processes are critical to the organization, what decisions and policies need to be implemented, and what kind of evidence and feedback loops can improve things. From all this, we can produce a high-level set of data requirements – what concepts, how interconnected, at what level of granularity, etc. – and working top-down from conceptual data models to produce more detailed logical and physical models.
But there are usually many data models in existence already – which may be conceptual, logical or physical. Some of these may be formally documented as models, whether using a proper data modelling tool or just contained in various office tools (e.g. Excel, PowerPoint, Visio, Word). Some of them are implicit in other documents, such as written policies and procedures, or can be inferred (“reverse engineered”) from existing systems and from the structure and content of data stores. Some concepts and the relationships between them are buried in people’s heads and working practices, and may need to be elicited.
And that’s just inside the organization. When we look outside, there may be industry models and standards, such as ACORD (insurance) and GS1 (groceries). There may also be models pushed by vendors and service/platform providers – IBM has been in this game longer than most. There may also be models maintained by external stakeholders – e.g., suppliers, customers, regulators.
There are several points to make about this collection of data models.
- There will almost certainly be conflicts between these models – not just differences in scope and level/ granularity, but direct contradictions.
- And some of these models will be internally inconsistent. Even the formal ones may not be perfectly consistent, and the inferred/ elicited ones may be very muddled. The actual content of a data store may not conform to the official schema (data quality issues).
- You probably don’t have time to wade through all of them, although there are some tools that may be able to process some of these automatically for you. So you will have to be selective, and decide which ones are more important.
- In general, your job is not simply to reproduce these models (minus the inconsistencies) but to build models that will support the needs of the business and its stakeholders. So looking at the existing models is necessary but not sufficient.
So why do you need to look at the “legacy” models at all? Here are the main reasons.
- Problems and issues that people may be experiencing with existing systems and processes can often be linked to problems with the underlying data models.
- Inflexibility in these data models may constrain future business strategies and tactics.
- New systems and processes typically need to transition from existing ones – not just data migration but also conceptual migration (people learning and adopting a revised set of business concepts and working practices) – and/or interoperate with them (data integration, joined-up business).
- Some of the complexity in the legacy models may be redundant, but some of it may provide clues about complexity in the real world. (The fallacy of eliminating things just because you don’t understand why they’re there is known as Chesterton’s Fence. See my post on Low-Hanging Fruit.) The requirements elicitation process typically finds a lot of core requirements, but often misses many side details. So looking at the legacy models provides a useful completeness check.
If your goal is to produce a single, consistent, enterprise-wide data model, good luck with that. I’ll check back with you in ten years to see how far you’ve got. Meanwhile, the pragmatic approach is to work at multiple tempos in parallel – supporting short term development sprints, refactoring and harmonizing in the medium term, while maintaining steady progress towards a longer-term vision. Accepting that all models are wrong, and prioritizing the things that matter most to the organization.
The important issues tend to be convergence and unbundling. Firstly, while you can’t expect to harmonize everything in one go, you don’t want things to diverge any further. And secondly, where two distinct concepts have been bundled together, trying to tease them apart – at least for future systems and data stores – for the sake of flexibility.
Finally, how do I know whether the model is any good? On the one hand, I need to be able to explain it to the business, so it had better not be too complicated or abstract. On the other hand, it needs to be able to reflect the real complexity of the business, which means testing it against a range of scenarios to make sure I haven’t embedded any false or simplistic assumptions.
Longer answers are also available. Would you like me to run a workshop for you?
Declaration of interest – in 2008(?) I wrote some white papers for IBM concerning the use of their industry models.