Meltdown: Why our systems fail and what we can do about it

Ceasar Medina died because of a computer glitch.

Though he was shot in a botched robbery attempt, his killer—a convicted felon named Jeremiah Smith—should have been behind bars at the time. But Smith was one of thousands of inmates that the Washington State Department of Corrections accidentally released because of a software problem: a bug in the DOC’s computer code that, for over a decade, miscalculated prisoner sentences.

Surprising meltdowns like the one at the DOC happen all the time. At UCSF—one of the world’s best hospitals—a sophisticated pharmacy robot and a high-tech prescription system confused a doctor, lulled a pharmacist into approving a massive overdose of a routine antibiotic, and automatically packaged 38 pills, instead of the single pill the doctor intended. A nurse, comforted by the barcode scanner that confirmed the dosage, gave the pills one by one to her patient, a 16-year-old boy, who nearly died as a result.

In 2012, Wall Street giant Knight Capital unintentionally traded billions of dollars of stock and lost nearly $500 million in just half an hour because of a software glitch. It was a stunning meltdown that couldn’t have happened a decade earlier, when humans still controlled trading.

And at the airlines, technological glitches, combined with ordinary human mistakes, have caused outages in reservation and ticketing systems, grounded thousands of flights, and accidentally given pilots vacation during the busy holiday season. These issues cost the airlines hundreds of millions of dollars and delayed nearly a million passengers.

To understand why these kinds of failures keep happening, we turn to an unexpected source: a 93-year-old sociologist named Charles Perrow. After the Three Mile Island nuclear meltdown in 1979, Perrow became interested in how simple human errors spiral out of control in complex technological systems. For Perrow, Three Mile Island was a wake-up call. The meltdown wasn’t caused by a massive external shock like an earthquake or a terrorist attack. Instead, it emerged from the interactionof small failures—a plumbing glitch, a maintenance crew’s oversight, a stuck-open valve, and a series of confusing indicators in the control room.

The official investigation blamed the plant’s staff. But Perrow thought that was a cheap shot since the accident could only be understood in retrospect. That was a scary conclusion. Here was one of the worst nuclear accidents in history, but it wasn’t due to obvious human errors or a big external shock. It somehow just emerged from small mishaps that came together in a weird way.

Over the next four years, Perrow trudged through the details of hundreds of accidents. He discovered that a combination of two things cause systems to exhibit the kind of wild, unexpected behaviors that occurred at Three Mile Island.

The first element is complexity. For Perrow, complexity wasn’t a buzzword; it had a specific definition. A complex system is more like an elaborate web than an assembly line; many of its parts are intricately linked and can easily affect one another. Complexity also means that we need to rely on indirect indicators to assess most situations. We can’t go in to take a look at what’s happening in the belly of the beast. In a nuclear power plant, for example, we can’t just send someone to see what’s happening in the core. We need to piece together a full picture from small slivers—pressure indications, water flow measurements, and the like.

The second part of Perrow’s theory has to do with how much slack there is in a system. He borrowed a term from engineering: tight coupling. When a system is tightly coupled, there is little buffer among its parts. The margin for error is thin, and the failure of one part can easily affect the others. Everything happens quickly, and we can’t just turn off the system while we deal with a problem.

In Perrow’s analysis, it’s the combination of complexity and tight coupling that pushes systems into the danger zone. Small errors are inevitable in complex systems, and once things begin to go south, such systems produce baffling symptoms. No matter how hard we try, we struggle to make a diagnosis and might even make things worse by solving the wrong problem. And if the system is also tightly coupled, we can’t stop the falling dominoes. Failures spread quickly and uncontrollably.

When Perrow came up with his framework in the early 1980s, the danger zone he described was sparse: it included exotic systems like nuclear facilities and space missions. But in the intervening years, we’ve steadily added complexity and tight coupling to many mundane systems. These days, computers—often connected to the internet—run everything from cars to cash registers and from pharmacies to prisons. And as we add new features to existing technologies—such as mobile apps to airline reservation systems—we continue to increase complexity. Tight coupling, too, is on the rise, as the drive for lean operations removes slack and leaves little margin for error.

This doesn’t necessarily imply that things are worse than they used to be. What it does suggest, though, is that we are facing a different kind of challenge, one where massive failures come not from external shocks or bad apples, but from combinations of technological glitches and ordinary human mistakes.

We can’t turn back the clock and return to a simpler world. Airlines shouldn’t switch back to paper tickets and traders shouldn’t abandon computers. Instead, we need to figure out how to manage these new systems. Fortunately, an emerging body of research reveals how we can overcome these challenges.

The first step is to recognize that the world has changed. But that’s a surprisingly hard thing to do, even in an era where businesses seem to celebrate new technologies like blockchain and AI. When we interviewed the former CEO of Knight Capital years after the firm’s technological meltdown, he said, “We weren’t a technology company—we were a broker that used technology.” Thinking of technology as a support function, rather than the core of a company, has worked for years. But it doesn’t anymore.

We need to assess our projects or businesses through the lens of complexity and tight coupling. If we are operating in the danger zone, we can try to simplify our systems, increase transparency, or introduce more slack. But even when we can’t change our systems, we can change how we manage them.

Consider a climbing expedition to Mount Everest. There are many hidden risks, from crevasses and falling rocks to avalanches and sudden weather changes. Altitude sickness causes blurred vision, and overexposure to UV rays leads to snow blindness. And when a blizzard hits, nothing is visible at all. The mountain is a complex and tightly coupled system, and there isn’t much we can do about that.

But we can still take steps to make climbing Everest safer. In the past, for example, logistical problems plagued several Everest expeditions: delayed flights, customs issues, problems with supply deliveries, and digestive ailments.

In combination, these small issues caused delays, put stress on team leaders, took time away from planning, and prevented climbers from acclimating themselves to high altitudes. And then, during the final push to the summit, these failures interacted with other problems. Distracted team leaders and exhausted climbers missed obvious warning signs and made mistakes they wouldn’t normally make. And when the weather turns bad on Everest, a worn-out team that’s running behind schedule stands little chance.

Once we realize that the real killer isn’t the mountain but the interaction of many small failures, we can see a solution: rooting out as many logistical problems as possible. And that’s what the best mountaineering companies do. They treat the boring logistical issues as critical safety concerns. They pay a lot of attention to some of the most mundane aspects of an expedition, from hiring logistical staff who take the burden off team leaders to setting up well- equipped base camp facilities. Even cooking is a big deal. As one company’s brochure put it, “Our attention to food and its preparation on Everest and mountains around the world has led to very few gastrointestinal issues for our team members.”

You don’t need to be a mountain climber to appreciate this lesson. After a quality control crisis, for example, managers at pharmaceutical giant Novo Nordisk realized that the firm’s manufacturing had become too complex and unforgiving to manage in traditional ways. In response, they came up with a new approach to finding and addressing small issues that might become big problems.

First, the company created a department of about twenty people who scan for new challenges that managers might ignore or simply not have the time to think about. They talk with non-profits, environmental groups, and government officials about emerging technologies and changing regulations. The goal is to make sure that the company doesn’t ignore small signs of brewing trouble.

Novo Nordisk also uses facilitators to make sure important issues don’t get stuck at the bottom of the hierarchy (as they did before the quality control crisis). The facilitators—around two dozen people recruited from among the company’s most respected managers—work with every unit at least once every few years, evaluating whether there are concerns unit managers may be ignoring. “We go around and find a number of small issues,” a facilitator explained. “We don’t know if they would develop into something bigger if we ignored them. But we don’t run the risk. We follow up on the small stuff.”

Other organizations use a different approach to manage this kind of complexity. NASA’s Jet Propulsion Laboratory (JPL) does some of the most complex engineering work in the world. Its mission statement is “Dare Mighty Things” or, less formally, “If it’s not impossible, we’re not interested.”

Over the years, JPL engineers have had their share of failures. In 1999, for example, they lost two spacecraft destined for Mars—one because of a software problem onboard the Mars Polar Lander and the other because of confusion about whether a calculation used the English or the metric system.

After these failures, JPL managers began to use outsiders to help them manage the risk of missions. They created risk review boards made up of scientists and engineers who worked at JPL, NASA, or contractors—but who weren’t associated with the missions they reviewed and didn’t buy into the same assumptions as mission insiders.

But JPL’s leaders wanted to go even further. Every mission that JPL runs has a project manager responsible for pursuing ground-breaking science while staying within a tight budget and meeting an ambitious schedule. Project managers walk a delicate line. When under pressure, they might be tempted to take shortcuts when designing and testing critical components. So senior leaders created the Engineering Technical Authority (ETA), a cadre of outsiders within JPL. Every project is assigned an ETA engineer, who makes sure that the project manager doesn’t make decisions that put the mission at risk.

If an ETA engineer and a project manager can’t agree, they take their issue to Bharat Chudasama, the manager who runs the ETA program. When an issue lands on his desk, Chudasama tries to broker a technical solution. He can also try to get project managers more money, time, or people. And if he can’t resolve the issue, he brings it to his boss, JPL’s chief engineer. Such channels for skepticism are indispensable in the danger zone because the ability of any one individual to know what’s going on is limited, and the cost of being wrong is just too high.

This approach isn’t rocket science. In fact, the creation of outsiders within an organization has a long history. For centuries, when the Roman Catholic Church was considering whether to declare a person a saint, it was the job of the Promoter of the Faith, popularly known as the Devil’s Advocate, to make a case against the candidate and prevent any rash decisions. The Promoter of the Faith wasn’t involved in the decision-making process until he presented his objections, so he was an outsider free from the biases of those who had made the case for a candidate in the first place.

The sports writer Bill Simmons proposed something similar for sports teams. “I’m becoming more and more convinced that every professional sports team needs to hire a Vice President of Common Sense,” Simmons wrote. “One catch: the VP of CS doesn’t attend meetings, scout prospects, watch any film or listen to any inside information or opinions; he lives the life of a common fan. They just bring him in when they’re ready to make a big decision, lay everything out and wait for his unbiased reaction.”

These solutions might sound obvious, and yet we rarely use them in practice. We don’t realize that many of our decisions contribute to complexity and coupling, resulting in increasingly vulnerable systems. We tend to focus on big, external shocks while ignoring small problems that can combine into surprising meltdowns. And we often marginalize skeptics instead of creating roles for them.

Today, we are in the golden age of meltdowns. More and more of our systems are in the danger zone, but our ability to manage them hasn’t quite caught up. And we can see the results all around us. The good news is that smart organizations are finding ways to navigate this new world, and we can all learn from them.

—

Excerpted from MELTDOWN by Chris Clearfield and András Tilcsik

Philosophy

Science & Tech

Mind & Behavior

Business

History & Society