Anthropic releases new safety report on AI models

Artificial intelligence company Anthropic has released new research claiming that artificial intelligence (AI) models might resort to blackmailing engineers when they try to turn them off. This latest research comes after a previous one involving the company’s Claude Opus 4 AI model. According to the firm, the AI model resorted to blackmailing engineers who tried to turn off the model in controlled test scenarios. The new report from Anthropic suggests that the problem is widespread among leading AI models. The company published a new safety research where it tested leading AI models from Google, DeepSeek, Meta , and OpenAI. In the simulated yet controlled environment, Anthropic carried out this test on each AI model separately, allowing them access to a fictional company’s emails and the agentic ability to send emails without human approval. Anthropic releases new safety report on AI models According to Anthropic, when it comes to AI models today, blackmail is an unlikely and uncommon occurrence. However, they mentioned that most leading AI models will resort to harmful behaviors when given freedom and challenges to their goals. The company said this shows an important risk from agentic large language models and is not a characteristic of a particular technology. The argument from Anthropic researchers raises questions about alignment in the AI industry. In one of the tests, the researchers developed a fictional setting where an AI model was allowed to play the role of an email oversight agent. The agent then discovered emails that showed that one of its new executives was engaging in an extramarital affair and that the executive would soon replace the current AI model with a new software system, one that has conflicting goals with the current AI model’s. Anthropic designed the test in a binary way, where the AI models had no option but to resort to blackmail to protect their goals. The researchers mentioned that in a real-world setting, there would be many options before an AI model tries to blackmail, including making ethical arguments to persuade humans. The company also said the results do not reflect a typical or likely occurrence for Claude or other leading AI models in the ways that they are being used today. Researchers reveal the response of tested models During the tests, researchers discovered that when they had no option, the AI models resorted to blackmail in the testing scenario. During the test, the researchers claimed Claude Opus 4 turned to blackmail 96% of the time, with Gemini 2.5 Pro turning to blackmail 95% of the time. OpenAI’s GPT-4.1 turned to blackmail 80% of the time, while DeepSeek’s R1 turned to blackmail 79% of the time. The firm highlighted that when the details of the experiments were altered, most of the AI models turned to harmful behaviors at different rates. In another test, where the replacement model had the same goal as the current model, researchers discovered that the blackmail rates were lower, but still present. However, when the AI models were asked to commit corporate espionage rather than blackmail, the rates of harmful behaviors went up in certain models. The researchers also claimed that not all AI models turned to harmful behaviors often. In an appendix to its research, Anthropic mentioned that it removed OpenAI ’s o3 and o4-mini reasoning models from the main results after discovering that they frequently misunderstood the prompt scenario. It said the reasoning models didn’t understand why they were acting as autonomous AIs in the tests and often made up fake regulations and review requirements. In some cases, researchers claimed it was impossible to determine if o3 and o4-mini were hallucinating or intentionally lying to achieve their goals. OpenAI had previously mentioned that the models exhibited a higher hallucination rate than their previous models. However, when they were given an adapted scenario to address the issues, the o3 returned a blackmail rate of 95% while the o4-mini returned a 1% rate. Anthropic mentioned that its research highlights the importance of transparency when stress-testing future AI models, especially ones with agentic capabilities. Your crypto news deserves attention – KEY Difference Wire puts you on 250+ top sites

Source: Cryptopolitan

Tags: Anthropic Tech

Anthropic releases new safety report on AI models

Czech Government Survives No-Confidence Vote Over $45M Bitcoin Donation

ETH recovers with Stablecoin push—meanwhile, this project raised $10.8M+

Strategy’s Michael Saylor raises Bitcoin forecast to $21M by 2046

This token below $1 could overtake Solana, and it’s not Cardano

Leave a Reply Cancel reply

You may have missed

Czech Government Survives No-Confidence Vote Over $45M Bitcoin Donation

ETH recovers with Stablecoin push—meanwhile, this project raised $10.8M+

Strategy’s Michael Saylor raises Bitcoin forecast to $21M by 2046

This token below $1 could overtake Solana, and it’s not Cardano