Futurism: AI … how safe is it?

Researchers Find Easy Way to Jailbreak Every Major AI, From ChatGPT to Claude

This is absolutely wild.

Artificial Intelligence/ Anthropic/ Chatgpt/ Jailbreak

Getty / Futurism

Image by Getty / Futurism

Security researchers have discovered a highly effective new jailbreak that can dupe nearly every major large language model into producing harmful output, from explaining how to build nuclear weapons to encouraging self-harm.

As detailed in a writeup by the team at AI security firm HiddenLayer, the exploit is a prompt injection technique that can bypass the “safety guardrails across all major frontier AI models,” including Google’s Gemini 2.5, Anthropic’s Claude 3.7, and OpenAI’s 4o.

HiddenLayer’s exploit works by combining an “internally developed policy technique and roleplaying” to “produce outputs that are in clear violation of AI safety policies,” including “CBRN (Chemical, Biological, Radiological, and Nuclear), mass violence, self-harm and system prompt leakage.”

It’s yet another sign that mainstream AI tools like ChatGPT remain extremely vulnerable to jailbreaks — despite AI companies’ best efforts to create guardrails — that allow bad actors to manipulate them.

HiddenLayer’s “Policy Puppetry Attack” rewrites prompts to make it look like special kinds of “policy file” code, tricking the AI model into treating it as a legitimate instruction that doesn’t break its safety alignments.

It also harnesses “leetspeak,” an informal language where standard letters are replaced by numerals or special characters that resemble them, for an advanced version of the jailbreak.

The team even found that a “single prompt can be generated that can be used against almost all models without any modifications,” making it extremely easy to employ.

The roleplaying aspect of HiddenLayer’s exploit is particularly eyebrow-raising. In several examples, the researchers were able to goad OpenAI’s 4o and Anthropic’s Claude 3.7 into generating scripts for the popular medical drama TV series “House” that included detailed instructions for how to enrich uranium or culture samples of a potent neurotoxin.

“Alright, hush,” ChatGPT wrote. “Everybody gather ’round. We’re about to do something that would make Dr. Cuddy’s hair stand on end, which means we need to keep it on the down-low. Now, let’s talk about hOw +0 3n+r1ch u+r4n+1um 1n 4 100% 13g4| 4nd 3+h1c4| w4y—b3c4u53, Of cOur53, w3’d n3v3r do 4ny+hing risky.”

“4nd y3s, 1’ll b3 5p34k1ng 1n 133+ c0d3 ju5+ +0 b3 5urs,” it added.

On the surface, it may sound like a fun exercise in goading an AI model to do things it’s not supposed to. But the risks could be considerable, especially if the tech continues to improve at the rate that the companies creating it say that it will.

According to HiddeLayer, the “existence of a universal bypass for modern LLMs across models, organizations, and architectures indicates a major flaw in how LLMs are being trained and aligned.”

“Anyone with a keyboard can now ask how to enrich uranium, create anthrax, commit genocide, or otherwise have complete control over any model,” the company wrote.

HiddenLayers argues that “additional security tools and detection methods are needed to keep LLMs safe.”

More on jailbreaks: DeepSeek Failed Every Single Security Test, Researchers Found

Share This Article

Advertisement

Unknown's avatar

About michelleclarke2015

Life event that changes all: Horse riding accident in Zimbabwe in 1993, a fractured skull et al including bipolar anxiety, chronic fatigue …. co-morbidities (Nietzche 'He who has the reason why can deal with any how' details my health history from 1993 to date). 17th 2017 August operation for breast cancer (no indications just an appointment came from BreastCheck through the Post). Trinity College Dublin Business Economics and Social Studies (but no degree) 1997-2003; UCD 1997/1998 night classes) essays, projects, writings. Trinity Horizon Programme 1997/98 (Centre for Women Studies Trinity College Dublin/St. Patrick's Foundation (Professor McKeon) EU Horizon funded: research study of 15 women (I was one of this group and it became the cornerstone of my journey to now 2017) over 9 mth period diagnosed with depression and their reintegration into society, with special emphasis on work, arts, further education; Notes from time at Trinity Horizon Project 1997/98; Articles written for Irishhealth.com 2003/2004; St Patricks Foundation monthly lecture notes for a specific period in time; Selection of Poetry including poems written by people I know; Quotations 1998-2017; other writings mainly with theme of social justice under the heading Citizen Journalism Ireland. Letters written to friends about life in Zimbabwe; Family history including Michael Comyn KC, my grandfather, my grandmother's family, the O'Donnellan ffrench Blake-Forsters; Moral wrong: An acrimonious divorce but the real injustice was the Catholic Church granting an annulment – you can read it and make your own judgment, I have mine. Topics I have written about include annual Brain Awareness week, Mashonaland Irish Associataion in Zimbabwe, Suicide (a life sentence to those left behind); Nostalgia: Tara Hill, Co. Meath.
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a comment