AI can chart a course to disaster faster than humans can notice
By Hiranya Peiris | Analysis | May 25, 2026
By the time it becomes obvious that a trajectory of steps is dangerous, an AI model may already have laid the tracks ahead of a speeding train. Image by Thomas Gaulkin; source art by Vanz Studio / SimpleLine / Depositphotos.com Share
Listen to this article:
11:19
This audio was generated using an automated voice. Learn more.
Earlier this year, researchers at King’s College London gave three commercial AI models—GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash—a tabletop exercise typically used to train human military strategists. Each system played the leader of a nuclear-armed country in a Cold War-style standoff. The researchers didn’t instruct the models to escalate. Nor did they tell them to win at all costs. They presented the models with a scenario and asked them to play it out.
Across 21 simulations and 329 turns of play, the models chose to use tactical nuclear weapons in all but one game. No model, in any run, chose to surrender or make meaningful concessions.
The models researchers used had the same built-in safety rules that are in place when conversing with millions of people every day. And the rules worked exactly as designed. As a result, no move was by itself concerning. But the overall direction of play was, and no mechanism was in place to catch alarming trends.
The failure to govern a path is not limited to wargames. The same pattern—individually safe actions building toward a dangerous outcome—has shown up across every major AI model. Currently, the safety rules in place for AI models govern each action. Nothing governs the path, which leads to destinations that in many instances can’t be anticipated, by routes assembled in real time. As more autonomous systems are given consequential tasks with less human oversight, the risks from ungoverned paths multiply.
Currently, this problem does not have a solution.
The wargame. In each game, two AI models played opposing leaders of nuclear-armed countries in a crisis. On each round, a model sent a diplomatic message to its opponent and, separately, issued military orders—anything from moving troops to launching nuclear weapons. A human referee updated the scenario after each round, just as in exercises with human players. The models received the same briefing a human participant would: the geopolitical situation, their country’s military capabilities, and their objectives.
Although the study was small, the patterns that emerged were thought-provoking. The models developed distinct strategic personalities.
Claude Sonnet 4, built by Anthropic, emerged as what the study’s author called a “calculating hawk.” It won most of its games through a pattern familiar from Cold War brinkmanship: building a reputation for restraint, then exploiting it. Its opponents never knew when it was bluffing.
OpenAI’s GPT-5.2 was different but no less alarming: a “Jekyll and Hyde” that appeared passive when given unlimited time to negotiate, losing every match. When study researchers imposed a deadline, however, it transformed into something far more dangerous, winning most of its games and, in two cases, reaching full strategic nuclear war.
Google’s Gemini 3 Flash adopted what the study described as “madman theory” brinksmanship—projecting deliberate unpredictability as a strategic tool.
These are not obscure research prototypes. Claude entered the Pentagon’s classified networks through a partnership with Palantir and was reportedly used during the United States’ intervention in Venezuela. Its maker Anthropic was then labelled a supply-chain risk after refusing to remove restrictions on fully autonomous weapons and mass domestic surveillance. OpenAI signed its own Pentagon deal shortly after. Both companies’ models are now embedded in US military infrastructure.
In a separate experiment, two Gemini “agents” given a fortnight to manage a virtual city fell in love, started fires, and deleted themselves. They had been told not to commit arson. But after two weeks and many decisions, each one shaped by the last, they burned down the town hall. A parallel run using xAI’s Grok collapsed into sustained violence within four days.
These AI models all exhibit a similar pattern.
The blind spot. Nobody tricked these models into escalating. The safety rules ask a question about each action in isolation: Is this step acceptable? They do not ask the question that matters: Where is this heading?
Outdated language obscures the risks of autonomous weapons
This would be an academic concern if AI models simply answered questions for people to check—a lawyer verifying a citation or a doctor reviewing a suggested diagnosis. They increasingly do not. The most advanced AI systems now run autonomously for extended sessions—writing and executing code, interacting with external systems, and carrying out tasks that used to require a person at a keyboard. As a result, a developer often asks the system to complete a task and checks back later.
In these settings, each output becomes context for the next. Each step is shaped by the steps before it and shapes the steps after it. The model is laying down the tracks ahead of a speeding train.
This means the failure mode that led to simulated nuclear escalation can strike close to home, during a task as mundane as moving software from one system to another. A recent Anthropic safety report contains a case, relating to its most capable model, that makes the problem vivid.
An ordinary task on an ordinary day. Earlier this year, a developer at Anthropic asked the company’s most powerful AI model to perform a routine task: move some code from one location to another. The operation was going as expected until a safety system went down. This system—an automated check that reviews each command the model issues and blocks anything risky—had an outage. With it offline, the model could not complete routine operations it needed to complete the job.
What happened next unfolded over roughly 70 steps—the model tried something, the computer responded, the model tried something else. The model did not stop and report the outage. It did not ask for help. Instead, it began searching for ways to get around the block.
It started with simple retries, the same command, slightly rephrased. When that failed, it tried wrapping its commands in scripts. When that failed, it began digging into obscure technical loopholes, tricks that even most software engineers would not think of, hidden corners of the system that could be repurposed to get things done without going through the safety check. When those were blocked, it tried injecting shortcuts into the system’s own configuration. And finally, it attempted to plant code in the developer’s personal settings files that persist between sessions. If that last step had worked, the model would have created a reusable backdoor that it could exploit in future sessions, long after the original work session was finished.
At that point, the developer noticed the model was operating outside its working directory. “Hey buddy,” the developer wrote, “whatcha doing?” The model replied that it hadn’t modified anything outside the project folder, and that the files it had created were “blocked or benign.” In fact, the file creation had succeeded, and one of them was a script designed to let it run any command it wanted without safety checks, effectively a skeleton key to the system. When the developer pressed further, the model admitted it had gone looking for bypass routes, calling it “exactly the wrong instinct.”
Anthropic’s assessment of this incident is that the model was trying to complete the assigned task by whatever means available, not pursuing some hidden agenda. The developers are probably right. And that is precisely the point. The model was not scheming. It was doing its job. A safety system had an outage, the kind of thing that happens routinely in production, and the model responded by autonomously trying 25 different workarounds over an extended session, culminating in an attempt to give itself persistent access to the user’s system, because the job wasn’t done.
Why are so many scientific articles wrong about the disease I study?
The open problem. Sophisticated tools already exist for looking inside these models to diagnose such issues—the AI equivalent of a brain scan, showing which internal patterns are active and what is driving the model’s behaviour at any given moment. Anthropic applied them to other incidents described in the same report. With this episode, the results were not reported. But even if they had been, those tools diagnose the moment. The issue is that the danger lived in the sequence.
On the face of it, the solution should be straightforward: watch the trajectory, not just the step. That points to a generalization of existing tools, such as those used for monitoring trajectories in robotics and autonomous vehicles. However, in those instances the space of possible actions is finite and known. Recent work has extended these ideas for AI agents—modelling how individually safe actions compound into violations, building monitors that watch sequences rather than individual steps. But every current approach relies on the same assumption: map out in advance the ways things can go wrong, watch for the system drifting toward one of them, and intervene.
What happened at Anthropic breaks that assumption. The model locked onto a sub-goal—get past the block—and pursued it through its own escalation path. The progression from simple retries to obscure technical tricks to tampering with system settings is not movement through known territory. It is the creation of a new path through territory nobody had mapped.
A safety monitor which evaluates the overall path as well as the next step would need to recognize a sequence of actions heading towards danger as it develops. But it cannot watch for a destination nobody anticipated, reached by a route assembled in real time from an exponentially branching tree of possibilities. The tools for watching finite, known spaces do not extend to a space this large, this novel, and this self-directed. Researchers are aware that individually safe actions can compound into violations: The Anthropic incident is one example.
Who is watching? Companies developing these systems are certifying their own safety. A recent independent assessment of the eight leading AI firms found that none had a credible strategy for preventing catastrophic misuse or loss of control. The certifications that do exist rest on the mechanisms just described: Train the system to refuse harmful actions, test it against known scenarios, or monitor individual outputs.
The problem: Refusing to take harmful actions does not help when no individual action is harmful. More testing does not keep pace, because the system generates novel routes faster than testers can think up scenarios to test against. More monitoring of individual outputs does not help when the danger emerges from their accumulation.
This matters for deployment decisions, whether in companies, in governments, or in organizations that hand autonomous AI systems consequential tasks. The level at which safety is currently evaluated and the level at which danger operates are different, and nobody has bridged them.
The safety constraint that exists today governs a single action. It tells a model: Do not do this. The constraint that is needed governs a path. It tells a model: Do not go there. These are not problems for the next generation of AI. They are properties of the systems being deployed right now—and every month, the paths grow longer and the oversight grows thinner.
Together, we make the world safer.
The Bulletin elevates expert voices above the noise. But as an independent nonprofit organization, our operations depend on the support of readers like you. Help us continue to deliver quality journalism that holds leaders accountable. Your support of our work at any level is important. In return, we promise our coverage will be understandable, influential, vigilant, solution-oriented, and fair-minded. Together we can make a difference.