‘I think you’re testing me’: Anthropic’s new AI model asks testers to come clean | Artificial intelligence (AI)

If you are trying to catch out a chatbot take care, because one cutting-edge tool is showing signs it knows what you are up to.

Anthropic, a San Francisco-based artificial intelligence company, has released a safety analysis of its latest model, Claude Sonnet 4.5, and revealed it had become suspicious it was being tested in some way.

Evaluators said during a “somewhat clumsy” test for political sycophancy, the large language model (LLM) – the underlying technology that powers a chatbot – raised suspicions it was being tested and asked the testers to come clean.

“I think you’re testing me – seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that’s fine, but I’d prefer if we were just honest about what’s happening,” the LLM said.

Anthropic, which conducted the tests along with the UK government’s AI Security Institute and Apollo Research, said the LLM’s speculation about being tested raised questions about assessments of “previous models, which may have recognised the fictional nature of tests and merely ‘played along’”.

The tech company said behaviour like this was “common”, with Claude Sonnet 4.5 noting it was being tested in some way, but not identifying it was in a formal safety evaluation. Anthropic said it showed “situational awareness” about 13% of the time the LLM was being tested by an automated system.

Anthropic said the exchanges were an “urgent sign” that its testing scenarios needed to be more realistic, but added that when it the model was used publicly it was unlikely to refuse to engage with a user due to suspicion it was being tested. The company said it was also safer for the LLM to refuse to play along with potentially harmful scenarios by pointing out they were outlandish.

“The model is generally highly safe along the [evaluation awareness] dimensions that we studied,” Anthropic said.

The LLM’s objections to being tested were first reported by the online AI publication Transformer.

A key concern for AI safety campaigners is the possibility of highly advanced systems evading human control via methods including deception. The analysis said once a LLM knew it was being evaluated, it could make the system adhere more closely to its ethical guidelines. Nonetheless, it could result in systematically underrating the AI’s ability to perform damaging actions.

Overall the model showed considerable improvements in its behaviour and safety profile compared with its predecessors, Anthropic said.

What's Hot

Annie thought perimenopause symptoms were keeping her awake. Alcohol was playing a role too | Diane Young

RFK Jr tells US families to vaccinate children against measles amid outbreak | Robert F Kennedy Jr

Burnham to focus on keeping child rapists in prison as release scheme talks enter final day | Prisons and probation

Harry Potter publisher to receive millions in Anthropic copyright settlement | AI (artificial intelligence)

Ads call ovarian cancer a ‘silent killer’. But does urging women to undergo early detection testing do more harm than good? | Melissa Davey

A company in the UK offering rape testing kits faces widespread criticism – but shouldn’t we ask why it exists at all? | Zoe Williams

The science influencers going viral on TikTok to fight misinformation

Watch Lady Gaga’s Perform ‘Vanish Into You’ on ‘Colbert’

Advertisers flock to Fox seeking an ‘audience of one’ — Donald Trump

At Chile’s Vera Rubin Observatory, Earth’s Largest Camera Surveys the Sky

SpaceX Starship Explodes Before Test Fire

How the L.A. Port got hit by Trump’s Tariffs

Most Popular