Catherine Yarborough was 61 years old. She was lying inside the radiotherapy machine at Kenstone Regional Medical Center in Georgia on June 3, 1985, for what was supposed to be a routine session. There was no reason to be afraid.
Then she felt something.
It wasn't exactly heat. It was more like a lightning bolt tearing through her chest from the inside. She started screaming. The technician, in a separate room, heard her over the intercom. She ran to the treatment room. The machine was silent. The screen displayed only:
MALFUNCTION 54
The technician calmed her down. She told her the session hadn't been delivered — it must have been a configuration error. These things happened. The message appeared occasionally. It had never caused problems before.
Catherine had just received an estimated radiation dose of 20,000 rads. The therapeutic dose was 200 rads. She survived, but with permanent injuries. And what happened to Catherine didn't stop there — it was only the first case that was ever documented.
Five more people followed. Most of them died.
A Machine Built on a Wrong Decision
The Therac-25 was a radiotherapy system manufactured by AECL — Atomic Energy of Canada Limited. It was introduced in 1982 as the successor to the Therac-20, a machine that had also been involved in accidents, though for different reasons.
The Therac-20 and its predecessor the Therac-6 shared a feature that the Therac-25 deliberately abandoned: hardware interlocks. Physical switches, mechanical locks, and electronic limits built into the machine itself, preventing dangerous operations regardless of what the software said.
AECL made a decision when designing the Therac-25 that reflected the technological optimism of the era: software had advanced so far that hardware interlocks were redundant. They would be replaced entirely by software checks.
This was not a reckless decision made by incompetent engineers. It was a considered choice by professionals who believed in their work. The software team had written the Therac-20 control code. They reused and extended it for the Therac-25. It had worked before.
What they didn't know — what they couldn't have known without the right kind of testing — was that hidden inside that reused code were two distinct bugs. Together, they turned a machine built to heal into a machine that killed.
And all of that code had been written by a single person.
The Programmer With No Name
The investigation conducted by the FDA in the years following the accidents attempted to reconstruct the software development history. What they found was disturbing.
The Therac-25 control code — the software that managed operating modes, firing sequences, and safety checks — had been written by a single programmer. No peer review. No formal testing. No quality documentation. No version control that would allow changes to be traced.
The identity of that programmer was never discovered. Not by FDA investigators. Not by academic researchers. AECL never publicly identified them.
This detail matters for two reasons. The first is technical: a single pair of eyes cannot see its own blind spots. The second is moral: when people started dying, there was no development process to audit, no documentation to analyse, no one to hold directly accountable. The code existed. The people who had supervised it — or who should have supervised it — had dispersed.
In systems that can kill, "a single programmer with no review" is not a process detail. It is a decision that transfers risk directly onto patients.
The Two Bugs
The Therac-25 had two operating modes:
- X-ray mode: high-powered beam directed through a beam spreader and flattening filter that attenuated the energy to therapeutic levels
- Electron mode: low-powered beam delivered directly, without a filter
Switching between modes required a physical rotation of an internal disc — the turret — to position the correct equipment in front of the beam. This rotation took approximately eight seconds.
This is where the two bugs come in. They are distinct. Each one lethal in its own way.
Bug 1: The Race Condition — the eight-second blind window
While the turret was rotating, the software routine controlling the disc was busy with that process. During those eight seconds, there was a window in which the system could not correctly respond to changes made by the operator.
What happened in practice: an operator selected X-ray mode, noticed a typing error, and quickly edited the parameters to switch to electron mode — all within that eight-second window. The turret continued rotating to the X-ray position, because it was already mid-rotation and nothing was stopping it. But the screen confirmed electron mode. The software believed it was in electron mode.
The machine fired maximum-power X-ray radiation with no filter in front of the beam.
The result: instead of the intended 200 rads, the patient received between 16,000 and 25,000 rads concentrated on a single point of the body.
The clearest analogy I can think of: imagine a traffic light. The signal turns green while the previous car is still crossing the intersection. The system signals "safe" before the safety condition is actually true. The crash isn't an accident — it's the direct consequence of a timing mismatch between two states that needed to be synchronised and weren't.
Bug 2: The Counter Overflow — when 255 plus 1 equals zero
This one is different. It doesn't depend on timing. It is deterministic — it happens every time a specific variable reaches a specific value.
The control software maintained a uniformity check variable — an 8-bit counter whose job was to confirm that safety settings had been validated before firing. An 8-bit counter can only store values from 0 to 255. When it reached 255 and the system tried to increment it by one more, the counter overflowed and rolled back to zero.
When the variable was at zero, the software interpreted the state as "check complete, all clear." And fired.
There was no timing window. No specific key sequence. It only required that the counter had reached 255 in a previous session — and from that point on, for the next patient, the safety check would be silently skipped.
This bug killed Glendod.
The Victims
Six confirmed cases. Four deaths directly from radiation overdose. Medical records are cold documents. What lies behind them is not.
Catherine Yarborough — 61 years old. Kenstone, Georgia. June 3, 1985. The first documented victim. Estimated overdose of ~20,000 rads. She survived with permanent injuries. AECL was notified. They concluded the machine had not malfunctioned.
Frances — 40 years old. Ontario, Canada. July 26, 1985. Overdose of ~17,000 rads. She died. This was the second case reported to AECL. The company investigated again. Concluded again that the machine was operating correctly. To hospitals that asked, AECL stated that overdoses were physically impossible with the Therac-25.
By this point, there were already two cases.
Unidentified woman — Yakima, Washington. December 3, 1985. Overdose of ~14,000 rads. She died. Yakima Valley Memorial Hospital contacted AECL. They received the same answer: impossible.
Von Ray Cox — 33 years old. Tyler, Texas. March 21, 1986. He had a recurrence of Hodgkin's disease in his shoulder. He walked in for a routine session. When the radiation hit him, he described feeling an electric shock, his arm going numb, and he ran out of the room screaming. The technician checked the screen: MALFUNCTION 54. She assumed the session had not been delivered.
Von Ray Cox died five months later. Overdose of ~25,000 rads.
Verdon Kidd — 66 years old. Tyler, Texas. April 11, 1986. Twenty days after the Cox case, on the same machine, in the same hospital. The overdose destroyed Verdon Kidd's right temporal lobe and brainstem. He died on May 1, 1986 — the fastest death of all documented cases. Between the session and death: less than three weeks.
At this point, AECL knew about the Cox and Kidd cases. It knew about the Yakima and Ontario cases. The company was being sued. And it continued to tell every hospital that overdoses were impossible.
Glendod — 65 years old. Yakima, Washington. January 17, 1987. This case was different. AECL had already issued a software fix after Tyler. But the fix had been sent by letter. Hospitals could choose whether to apply it. Yakima had not.
Glendod received between 8,000 and 10,000 rads. The bug that killed him was the counter overflow — not the race condition. A different machine, a different bug, a different victim. He died on April 11, 1987.
AECL's Response: The Scandal
For two years, AECL responded to every incident with the same statement: overdoses were impossible with the Therac-25. They said this to hospitals. They said it to the FDA. They said it while being sued by victims' families. They said it while documented cases continued to accumulate.
This was not calculated dishonesty — it was something potentially more dangerous: genuine conviction in a claim they had never properly verified. AECL believed the software was correct because they had tested it and found no errors. The absence of a reproducible error was, to them, proof that no error existed.
But when the FDA pressed for concrete solutions, AECL's response revealed the depth of the problem.
The official proposal to resolve the race condition was this: remove the up-arrow key from the machine keyboards and cover the key contacts with electrical tape.
Not a code rewrite. Not a software audit. Not a restoration of the hardware interlocks that had been removed. Electrical tape. On a keyboard key. On a machine that had already killed four people.
The FDA rejected the proposal. But the fact that it was formally submitted says everything about how the company understood the problem — and how completely it didn't.
What the Investigations Found
The definitive account of the Therac-25 is a paper published in 1993 by Nancy Leveson, a professor at MIT, and Clark Turner. "An Investigation of the Therac-25 Accidents" is now required reading in software safety engineering courses worldwide.
Leveson and Turner documented not just the technical failures, but the systemic patterns that allowed a known problem to keep injuring people:
AECL's confidence was not evidence-based. The company repeatedly stated the software was correct before understanding what was causing the accidents. It was cultural confidence, not technical.
The error messages were useless. "MALFUNCTION 54" appeared for dozens of different conditions — most of them harmless. Technicians had been implicitly trained to dismiss it. When it appeared during an overdose, it was indistinguishable from the thousand times it had meant nothing. An error message that can mean everything means nothing.
There was no software version control. Different hospitals were running different versions with different bugs. AECL could not reliably track which machine was running which version.
Fixes were not mandatory. After each incident, AECL issued software updates by letter. Hospitals could choose whether to apply them. Two of the accidents happened after a fix was already available.
The regulatory framework was not designed for software. The FDA regulated medical devices. In the 1980s, a radiation machine was treated as a physical device, not a software system. The rules for testing mechanical interlocks did not apply to code.
What This Means for Anyone Who Builds Software
I spent years working on infrastructure that processes financial transactions for millions of people. The failure modes in the Therac-25 are not exotic — they are exactly the mistakes made when moving fast under pressure, reusing code without fully understanding it, and trusting your own tests more than you should.
The race condition that killed people in Texas and Washington is the same class of bug that causes incorrect bank balances, lost messages, and corrupted database records. In those cases, nobody dies. You fix it, write a post-mortem, and move on. In a radiation machine, the same mistake has a body count.
A single programmer with no review is a risk decision, not a process detail. In any system where a failure can harm someone, code review is not bureaucracy. It is the second line of defence when the first — the programmer themselves — cannot see their own blind spots.
Software that controls physical systems must be treated differently. Hardware interlocks were not redundancy — they were a different kind of guarantee, operating at a different layer. When you removed them and replaced them with software, you collapsed multiple defences into one. When that single defence failed, there was nothing left to catch it.
An error message that appears for everything is not a safety feature. It is noise. MALFUNCTION 54 trained technicians to ignore warnings. The machine created the conditions for its own errors to go undetected.
"Impossible" is an engineering claim that requires proof, not conviction. AECL said overdoses were impossible because they believed it. But proof would have meant formally and mathematically demonstrating that no sequence of system states could produce the result. That was never done. The alternative — tests that didn't reproduce the problem — is not the same thing.
A fix that doesn't reach everyone is not a fix. A letter with an optional update is not a safety mechanism. Glendod died after the fix was available because "available" is not the same as "applied."
I think often about the Tyler technician who saw MALFUNCTION 54 after Von Ray Cox's session and assumed the session hadn't been delivered. She had seen that message dozens of times with no consequences. The machine had trained her to respond that way. When the message finally mattered, she had no way to know. That was not her failure. It was a system that turned a trained professional's experience into a liability.
Go Deeper
If this story has affected you, these are the resources worth your time:
To understand what happened:
- "An Investigation of the Therac-25 Accidents" by Nancy Leveson and Clark Turner (1993) — the original paper, available free online. Long. Dense. Essential.
- "Safeware: System Safety and Computers" by Nancy Leveson — the book that grew out of this research. The foundational text on software safety.
To build systems that don't fail this way:
- "Release It!" by Michael Nygard — how to design for failure, not just against it
- "Concurrent Programming in Java" by Doug Lea — if you write concurrent code, understanding race conditions in depth is not optional
- Coursera: "Software Security" (University of Maryland) — covers threat modelling and how to reason systematically about failure
- Coursera: "Software Testing and Automation" (University of Minnesota) — testing fundamentals for critical systems, including techniques that surface race conditions
- "The Art of Software Testing" by Glenford Myers — the classic text on testing, including the limits of what testing can and cannot prove
To understand the broader pattern:
- "To Engineer Is Human" by Henry Petroski — on how failure is inseparable from engineering, and why that means we must design for it
- "Normal Accidents" by Charles Perrow — why complex, tightly coupled systems fail in ways that are structurally inevitable, and what system design can do about it
More articles coming soon.
Sources:
- Leveson, Nancy G. and Turner, Clark S. "An Investigation of the Therac-25 Accidents." IEEE Computer, 26(7), July 1993.
- Leveson, Nancy G. Safeware: System Safety and Computers. Addison-Wesley, 1995.
- U.S. Food and Drug Administration. Recall and Safety Alerts: Therac-25 (1987–1988).
- Joyce, Edward J. "Malfunction 54: Unraveling Fatal Flaws in Therac-25." Computerworld, 1987.
- Jacky, Jonathan. "Programmed for Disaster." The Sciences, September/October 1989.
- National Institute of Standards and Technology. Computer System Reliability and Nuclear Safety (internal review, 1993).
