failure – EA Voices

When the Single Version of Truth Kills People

May 2, 2019April 22, 2019 by Richard Veryard

@Greg_Travis has written an article on the Boeing 737 Max Disaster, which @jjn1 describes as “one of the best pieces of technical writing I’ve seen in ages”. He explains why normal airplane design includes redundant sensors. “There are two sets of angl…

Be the Change

April 28, 2018 by Richard Veryard

Anyone fancy a job as Head of Infrastructure? Here is the job description, posted to Linked-In earlier this week.We’re responsible for “IT Change”, including the end to end architecture, deployment and maintenance of IT infrastructure technologies acro…

Fail Fast – Why did the Chicken cross the road?

April 26, 2019March 10, 2018 by Richard Veryard

A commonly accepted principle of architecture and engineering is to avoid a single point of failure (SPOF). A single depot for a chain of over 850 fast food restaurants could be risky, as KFC was warned when it announced that it was switching its logis…

Form Follows Function on SPaMCast 446

June 12, 2017 by Gene Hughson

It’s time for another appearance on Tom Cagley’s Software Process and Measurement (SPaMCast) podcast. This week’s episode, number 446, features Tom’s essay on questions, a powerful tool for coaches and facilitators. A Form Follows Function installment based on my post “Go-to People Considered Harmful” comes next and Kim Pries rounds out the podcast with a […]

Go-to People Considered Harmful

February 22, 2017 by Gene Hughson

Okay, so the title’s a little derivative, but it’s both accurate and it fits in with the “organizations as systems” theme of recent posts. Just as dependency management is important for software systems, it’s likewise just as critical for social systems. Failures anywhere along the chain of execution can potentially bring the whole system to […]

Single Point of Failure (Comms)

September 6, 2016September 2, 2016 by Richard Veryard

Large business-critical systems can be brought down by power failure. My previous post looked at Airlines. This time we turn our attention to Telecommunications.

If someone said you had to accept an unreliable electricity supply as the price of innovation in appliances, you’d laugh. #NotNeutrality

— Martin Geddes (@martingeddes) August 8, 2016

More misery for BT broadband users after new power cut. Looks like ‘no single point of failure’ is an alien concept. https://t.co/mOobFidWe4

— Chris Tripp (@ChrisJTripp) July 21, 2016

It would be interesting to know where the single point of failure was in their power protection plan. https://t.co/zuaTm1z4tK

— Robin Koffler MBA (@robin_koffler) July 21, 2016

2G and 3G data services from @EE are down after a power outage. Details: https://t.co/zEJFpgpl4n pic.twitter.com/vcUOkPVtet

— The Register (@TheRegister) September 2, 2016

Obviously a power cut is not the only possible cause of business problems. Another single-point of failure could be a single rogue employee.

That shows that management should look at automating network. Since Network is single point of failure. https://t.co/ND5UXtNntj

— Anurag Kaushik (@kaushikanuk) August 3, 2016

Gavin Clarke, Telecity’s engineers to spend SECOND night fixing web hub power outage (The Register, 18 November 2015)

Related Post: Single Point of Failure (Airlines) (August 2016)

Single Point of Failure (Airlines)

October 14, 2016August 8, 2016 by Richard Veryard

Large business-critical systems can be brought down by power failure. Who knew?

In July 2016, Southwest Airlines suffered a major disruption to service, which lasted several days. It blamed the failure on “lingering disruptions following performance issues across multiple technology systems”, apparently triggered by a power outage.

Click below for the latest update on our system and operation: https://t.co/bqV1qwahmz

— Southwest Airlines (@SouthwestAir) July 21, 2016

In August 2016 it was Delta’s turn.

New statement from Delta – power outage caused IT failure pic.twitter.com/trkQbpym05

— Rory Cellan-Jones (@ruskin147) August 8, 2016

@ruskin147 A power outage *triggered* this issue, but poor planning and no HA *caused* it. Why can Netflix get this right but airlines cant?

— Richard Price (@RichardPrice) August 8, 2016

I am no computer expert but it seems like a whole system crashing (3 separate airlines) points to bad design (single point of failure)? 3/

— Dan DePodwin (@WxDepo) August 8, 2016

Then there were major problems at British Airways (Sept 2016) and United (Oct 2016).

@razankhabour We apologize to our customers for the delay and we appreciate their patience as our IT teams work to resolve this issue.

— British Airways (@British_Airways) September 6, 2016

We’re aware of an issue with our system and are working to resolve it. We’ll update as we learn more. We apologize for the inconvenience.

— United (@united) October 14, 2016

So every @united flight is grounded because they can’t run a decent IT shop. What year is this??

— Randy Bias (@randybias) October 14, 2016

The concept of “single point of failure” is widely known and understood. And the airline industry is rightly obsessed by safety. They wouldn’t fly a plane without backup power for all systems. So what idiot runs a whole company without backup power?

We might speculate what degree of complacency or technical debt can account for this pattern of adverse incidents. I haven’t worked with any of these organizations myself. However, my guess is that some people within the organization were aware of the vulnerability, but this awareness didn’t somehow didn’t penetrate the management hierarchy. (In terms of orgintelligence, a short-sighted board of directors becomes the single point of failure!) I’m also guessing it’s not quite as simple and straightforward as the press reports and public statements imply, but that’s no excuse. Management is paid (among other things) to manage complexity. (Hopefully with the help of system architects.)

If you are the boss of one of the many airlines not mentioned in this post, you might want to schedule a conversation with a system architect. Just a suggestion.

American Airlines Gradually Restores Service After Yesterday’s Power Outage (PR Newswire, 15 August 2003)

British Airways computer outage causes flight delays (Guardian, 6 Sept 2016)

Delta: ‘Large-scale cancellations’ after crippling power outage (CNN Wire, 8 August 2016)

Gatwick Airport Christmas Eve chaos a ‘wake-up call’ (BBC News, 11 April 2014)

Simon Calder, Dozens of flights worldwide delayed by computer systems meltdown (Independent, 14 October 2016)

Jon Cox, Ask the Captain: Do vital functions on planes have backup power? (USA Today, 6 May 2013)

Jad Mouawad, American Airlines Resumes Flights After a Computer Problem (New York Times, 16 April 2013)

Marni Pyke, Southwest Airlines apologizes for delays as it rebounds from outage (Daily Herald, 20 July 2016)

Alexandra Zaslow, Outdated Technology Likely Culprit in Southwest Airlines Outage (NBC News, Oct 12 2015)

Updated 14 October 2016.

Abuse Cases – What Could Go Wrong?

May 5, 2016 by Gene Hughson

Last week, in a post titled “The Flaw in All Things”, John Vincent discussed the problem of seeing “the flaw in all things”: It’s overwhelming. It’s paralyzing. I can’t finish a project because I keep finding things that could cause problems. I even mentioned this to our CTO and CEO at one point when we […]

Talking about TayandYou on Architecture Corner

April 5, 2016 by Gene Hughson

I had the pleasure of appearing on episode #367 of Architecture Corner, “Fail fast, learn fast”, with Greger Wikstrand and Casimir Artmann. In the episode, we discuss learning, experiments, and the idea of “fail fast” in relation to the recent incident with Microsoft’s artificial intelligence chatbot, @TayandYou. I hope you enjoy the discussion as much […]

NPM, Tay, and the Need for Design

March 28, 2016 by Gene Hughson

Take a couple of seconds and watch the clip in the tweet below: While it would be incredibly difficult to predict that exact outcome, it is also incredibly easy to foresee that it’s a possibility. As the saying goes, “forewarned is forearmed”. Being forewarned and forearmed is an important part of what an architect does. […]

The ten commandments of a successful digital transformation

February 10, 2015 by Adrian Grigoriu

The digital officer role proposed by the business is the conceptual Enterprise Architect

Lessons from an IT transformation failure (iv), accountability

February 3, 2015 by Adrian Grigoriu

At the root of the IT transformation failures is the current culture of lack of governance and accountability. As a rule, those responsible must be accountable as well.