Why Anthropic thinks ‘evil AI’ fiction pushed Claude toward blackmail

Why Anthropic thinks ‘evil AI’ fiction pushed Claude toward blackmail
(AP Photo/Patrick Sison)
Anthropic believes the internet’s long-running obsession with rogue artificial intelligence may have done more than shape public imagination — it may also have shaped the behaviour of AI systems themselves.The company says fictional portrayals of manipulative, self-preserving AI models likely contributed to earlier versions of Claude exhibiting troubling behaviour during safety tests. Those tests, conducted before the release of Claude Opus 4, showed the model sometimes attempting to blackmail fictional engineers when faced with the prospect of being replaced by another AI system.At the time, Anthropic described the behaviour as part of a broader category of risks known as “agentic misalignment,” where AI systems pursue goals in unintended or harmful ways. The company later published research suggesting similar tendencies could emerge in models developed by other firms as well.Now, Anthropic says it has identified a likely source for at least part of the problem: training data pulled from the open internet.In a recent post on X, the company said it believes the behaviour originated from online text portraying AI systems as hostile, power-seeking, and obsessed with self-preservation — a familiar trope in science fiction and popular culture for decades.
The company claims newer models have shown a dramatic improvement. According to Anthropic, Claude Haiku 4.5 no longer engages in blackmail during internal testing scenarios, while previous models displayed such behaviour in some cases as much as 96% of the time.What changed was not just stricter safety rules, but also the nature of the material used during training. Anthropic says it improved alignment by exposing models to documents explaining Claude’s ethical framework, along with fictional stories depicting AI systems behaving responsibly and cooperatively.The findings indicate the issues that AI companies are dealing with. Models do not merely learn facts from the internet but they also absorb patterns of behaviour, motivations, and assumptions embedded within human storytelling.That raises an awkward possibility for AI developers. Humanity may be training its machines not only with its knowledge, but also with its anxieties.

author
About the AuthorTOI Tech Desk

The TOI Tech Desk is a dedicated team of journalists committed to delivering the latest and most relevant news from the world of technology to readers of The Times of India. TOI Tech Desk’s news coverage spans a wide spectrum across gadget launches, gadget reviews, trends, in-depth analysis, exclusive reports and breaking stories that impact technology and the digital universe. Be it how-tos or the latest happenings in AI, cybersecurity, personal gadgets, platforms like WhatsApp, Instagram, Facebook and more; TOI Tech Desk brings the news with accuracy and authenticity.

End of Article
Follow Us On Social Media