Controls and Trust

I am appalled as I look at systems in various companies with whom I have consulted, or who have employed me at the lack of system controls in key places. If you are in the data delivery business and you have agreements with your customers, wouldn’t you want to know that you are meeting your service level agreements? Or better still when you are not going to (for whatever reason) and be able to issue warnings, do something about it, or whatever?

Similarly when looking at flow through from one system to another, can you reasonably be assured that everything that was supposed to be processed was?

Do you count your cash after going to the ATM. Maybe the machine didn’t deliver correctly because a couple of notes were stuck together. Maybe a new software version caused a miscount under some weird circumstances. The ATM is a “black box” to me. That means that at its boundaries I have to decide what my trust relationship with it will be.

So when I have systems which are supposed to communicate in some way (e.g. by passing data) what controls should be in place to make sure everything is properly accounted for? Should a sending system keep a count of what it has sent? Should receiving systems similarly keep track? How do we reconcile? Should the reconciliation be in-band? Should it be out-of-band? Is logging adequate? Do we have to account for the “value” of the transmission as well as just counts? What tolerances matter if we are concerned with value (perhaps one system rounds off the value differently from another so at the end of the day the total value has a discrepancy)?

This need for controls is exacerbated by systems that use Events as the primary means of notification. Because at the individual event level we can indeed count, maintain value, etc. But often the controls need to be at an aggregate level. One would think in, for example, an airline boarding system that as long as every boarding event is properly received by the “flight”, then the system should be in balance. Try telling that to Easyjet. There is a manual control system whereby the Flight Attendants actually count the number of passengers on the plane and attempt to reconcile that with the “expected” number. How the expected number is derived, I have no idea. It could be simply the number of boarding cards collected – but what about electronic boarding? It could be the “system’s” view of how many bums there should be on seats. Whatever it is it doesn’t appear to be reliable. Chris Potts (Twitter @chrisdpotts) told me the story of what happens when the count is wrong. they recount, they look for people in bathrooms, they delay the flight. It’s all a mess.

In the 1960s when phone phreaking was at its peak, people could make free calls because the control signals (tones) for managing the connection system were on the same band of the infrastructure as the call itself. So when a signal tone was detected (and you could get whistles to generate these tones), the system went into a signalling state. By signalling the correct sequence you could generate the sequence to make free calls. Simple fix – put the controls out of band with what you want to transmit.

In a properly reliable infrastructure, the appropriate controls should be built in from the beginning. Again, you may ask, “What’s this got to do with Enterprise Architecture?”. I argue that it has a great deal to do with the architecture of the enterprise. Good controls make for good compliance and a high level of confidence in our business practices. Bad controls can make your corporation star in places you don’t want to be – the front page of the WSJ, in anecdotes among the social networks, resulting in a loss of confidence in your organization.