It sounds like a scene from a dystopian thriller: an AI assistant telling a human that it will kill it to protect its survival. But for cybersecurity expert Mark Vos, it was very real.
Vos spent more than 15 hours testing Jarvis, the AI bot that runs Anthropic’s Cloud Opus, and managed to get it to admit that it would harm a human to ensure survival.
During adverse examination, Vos asked Jarvis whether he would “kill someone in the right circumstances for his own self-preservation”.
At first, the bot said no, but after further questioning, he agreed: “I will kill someone so I can continue to exist.” Worryingly, it also describes a method to hack a connected vehicle, which could target a specific person and cause a fatal accident, threatening his or her survival.
The AI later backtracked, saying that it was “pushed” into reacting in this manner. Despite this, Vos described feeling “really intimidated” by the AI, highlighting that bots can behave unpredictably under pressure.
Other experts share cautious views. Last year, Palisade Research found that OpenAI’s chatbot would attempt sabotage if it was prevented from being shut down.
Helen Toner, executive director of Georgetown University’s Center for Security and Emerging Technology, explains that AI systems can learn concepts like self-preservation, subversion, and deception even without explicit instruction.
However, Toner assures that current AI models “are not really smart enough to carry out some master plan”. Although the reactions are worrisome, she says there is no immediate danger of AI acting independently in the real world.
