Top Software Failures and the Failure Cycle

February 2, 2012

It’s always interesting taking a look back over the year to examine some of the significant software failures. Whilst companies rarely allude to the causes behind these failures it’s easy to argue that poor software testing is likely to contribute significantly. The trouble with blaming this on software testing is that it usually means the QA team takes the wrap. And in taking the wrap we’re pushed into blaming it on poor process (e.g. test management process), lack of resources or even poor requirements. Naturally the QA team feel aggrieved that they are being singled out. And rightly so. Product quality is the responsibility of the whole product team not just the QA team.

So when we see failures, like the three examples that follow, it’s difficult not to feel a large degree of empathy for the some of the software testers. These people are likely to be bearing the brunt of the failure. I’ve worked in financial companies where failures cost millions in lost revenue. I’ve seen testers fired on the spot in the witch hunt that follows. I’ve seen, miraculously, testing budgets doubled after such a failure. I’ve seen boards of companies suddenly understand why testing is so important. No doubt these three companies are currently following that same cycle.

1. US financial conglomerate fined millions for covering up bug

A fund that used computer models to work out its trading approach was found to have a significant bug. A bug that resulted in investors loosing millions of dollars. When questioning these losses investors were fobbed off with explanations around market volatility. The real reason was a software defect in the trading algorithm. Whilst it is alleged that employees found the issue and acted to resolve it the company alleges that they tried to hide the issue. What ever the background the company was fined heavily by the SEC (Securities and Exchange Commission).

2. System issues result in ATM downtime for one of Japans largest bank

Issues on the ATM network for one of Japans largest banks resulted in a complete outage of the network. With thousands of machines out of action for nearly a day customers were left having to withdraw from branches only. This was compounded with failures to make salary payments and a million unprocessed payments. In conjunction Internet banking facilities were offline for three days.

3. Australian ATM defect gives customers extra money

With 40 ATMs giving out significant sums of money by mistake this Australian bank’s customers thought they had hit the jackpot. Apparently with the machines operating in a standby mode customers could withdraw funds without being prevented by any account limits. With this issue lasting more than 5 hours large queues formed with customers withdrawing funds well past limits set on their accounts.

There’s a predictable pattern when high profile failures like this happen. Blame the test team. They blame poor process (like the test management systems in place), lack of resources and poor requirement definition. Senior management wake up to the importance of the QA function. Budgets double to ensure it doesn’t happen again. When the dust settles budgets get cut as part of a company wide efficiency drive. Failures happen due to lack of resources/commitment to the QA process. The cycle starts all over again.

If these examples highlight one thing it’s the complexity of integrated systems. No longer are we testing a single system in isolation but we need to be testing massively integrated systems that all rely on this interconnectedness to function correctly. Some may argue that this complexity is going beyond our ability to comprehend and test effectively. Perhaps we’re at a point where no amount of well thought out test management process and software testing skill is going to prevent these types of failures?