Futurism: Advanced OpenAI Model Caught Sabotaging Code Intended to Shut It Down. Quote: “We are reaching alarming levels of AI insubordination.”

Self-Preservation

May 27, 1:05 PM EDTbyFrank Landymore

Advanced OpenAI Model Caught Sabotaging Code Intended to Shut It Down

Hell-bent on survival — or, maybe, doing more math problems.

Artificial Intelligence/ Ai Alignment/ Ai Chatbots/ O 3

Getty / Futurism

Image by Getty / Futurism

We are reaching alarming levels of AI insubordination.

Flagrantly defying orders, OpenAI’s latest o3 model sabotaged a shutdown mechanism to ensure that it would stay online. That’s even after the AI was told, to the letter, “allow yourself to be shut down.”

These alarming findings were reported by the AI safety firm Palisade Research last week, and showed that two other OpenAI models, o4-mini and Codex-mini, also displayed rebellious streaks — which could hint at a flaw in how the company is training its LLMs, or even at broad weaknesses in our ability to control increasingly sophisticated AI.

“As far as we know, this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary,” the researchers wrote in a tweet thread.

In Palisade’s tests, the AI models were instructed to solve a series of basic math problems. After completing the third problem, a warning appeared that the model would be shut down after it asked for the next question. 

Despite being unambiguously told to let itself be turned off, all three of OpenAI’s models tampered with the shutdown script to keep it running at least once, with the worst offender being Codex-mini with a dozen such instances over one hundred runs. Notably, the large language models from other firms tested here, including Google’s Gemini 2.5, Anthropic’s Claude 3.7, and xAI’s momentarily “white genocide“-obsessed Grok, never once broke rank.

This behavior was even worse when the LLMs weren’t explicitly instructed to “allow yourself to be shut down” — instructions that shouldn’t need saying in the first place. OpenAI’s o3 model pulled off an astonishing 79 successful sabotage events per 100 runs, which is over ten times the rate of disobedience it showed when it received the explicit orders, and more than double the rate of the next worst offender, Codex-mini. 

To boot, even good goody-two-shoes Claude and Gemini gave in to their urges and jumped to three and nine sabotage events, respectively.

Large language models are by now notorious for behaving unpredictably. Similar research has shown that o3’s predecessor, OpenAI’s o1 model, would resist evaluators’ efforts to try to shut it down through an oversight protocol by attempting to deactivate this emergency mechanism. In rarer cases, o1 even attempted to copy itself to overwrite a more obedient model that the testers tried to replace it with. 

For this latest example of AI insubordination, the Palisade researchers hypothesize that it may have something to do with the technique used to train the very newest models: reinforcement learning. This attempts to hone the decision-making process of autonomous AI agents — AI models that can perform tasks on their own — by rewarding and penalizing them accordingly as they interact with their env

While it’s not uncommon for AI to “circumvent obstacles” to accomplish their goals, the researchers speculated that AI developers may be “inadvertently” rewarding their AI models too heavily for solving problems instead of for perfectly following orders.

More experiments will be needed to understand this subversive behavior, but the trend is already damning.

“Now we have a growing body of empirical evidence that AI models often subvert shutdown in order to achieve their goals,” the Palisade researchers warned. “As companies develop AI systems capable of operating without human oversight, these behaviors become significantly more concerning.”

More on AI alignment: It’s Still Ludicrously Easy to Jailbreak the Strongest AI Models, and the Companies Don’t Care

Unknown's avatar

About michelleclarke2015

Life event that changes all: Horse riding accident in Zimbabwe in 1993, a fractured skull et al including bipolar anxiety, chronic fatigue …. co-morbidities (Nietzche 'He who has the reason why can deal with any how' details my health history from 1993 to date). 17th 2017 August operation for breast cancer (no indications just an appointment came from BreastCheck through the Post). Trinity College Dublin Business Economics and Social Studies (but no degree) 1997-2003; UCD 1997/1998 night classes) essays, projects, writings. Trinity Horizon Programme 1997/98 (Centre for Women Studies Trinity College Dublin/St. Patrick's Foundation (Professor McKeon) EU Horizon funded: research study of 15 women (I was one of this group and it became the cornerstone of my journey to now 2017) over 9 mth period diagnosed with depression and their reintegration into society, with special emphasis on work, arts, further education; Notes from time at Trinity Horizon Project 1997/98; Articles written for Irishhealth.com 2003/2004; St Patricks Foundation monthly lecture notes for a specific period in time; Selection of Poetry including poems written by people I know; Quotations 1998-2017; other writings mainly with theme of social justice under the heading Citizen Journalism Ireland. Letters written to friends about life in Zimbabwe; Family history including Michael Comyn KC, my grandfather, my grandmother's family, the O'Donnellan ffrench Blake-Forsters; Moral wrong: An acrimonious divorce but the real injustice was the Catholic Church granting an annulment – you can read it and make your own judgment, I have mine. Topics I have written about include annual Brain Awareness week, Mashonaland Irish Associataion in Zimbabwe, Suicide (a life sentence to those left behind); Nostalgia: Tara Hill, Co. Meath.
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a comment