We need a new Turing test to assess AI’s real-world knowledge

Artificial intelligence (AI) models can perform as well as humans on law exams when answering multiple-choice, short-answer and essay questions (A. Blair-Stanek et al. Preprint at SSRN https://doi.org/p89q; 2025), but they struggle to perform real-world legal tasks. Some lawyers have learnt that the hard way, and have been fined for filing AI-generated court briefs that misrepresented principles of law and cited non-existent cases. The same is true in other fields. For example, AI models can pass the gold-standard test in finance — the Chartered Financial Analyst exam — yet score poorly on simple tasks required of entry-level financial analysts (see go.nature.com/42tbrgb).

How should we test AI for human-level intelligence? OpenAI’s o3 electrifies quest

Whenever assessments measure the intended skill inaccurately, it is considered a proxy failure. For example, a lawyer who scored A+ on an exam would be expected to avoid the kinds of error that an AI tool with a similar score might make in a real-world scenario. Better tests are urgently required to help guide the use of AI in complex, high-stakes situations.

One promising idea emerged in March at an Association for the Advancement of Artificial Intelligence workshop in Philadelphia, Pennsylvania: through extensive interaction, a specialist can tell whether an AI system genuinely understands or is merely imitating understanding.

Imagine an AI model attempting to ‘pass’ an interview with an acclaimed legal scholar such as Cass Sunstein at Harvard University in Cambridge, Massachusetts. Sunstein’s expert probing would be a better measure of the model’s legal knowledge than a standardized test or automatically scored benchmark. Passing the ‘Sunstein test’ would require an AI tool to display true legal mastery, being able to wade through ambiguity and contradiction, and not just answer multiple-choice questions or write an essay.

One might ask: why not simply test an AI model’s legal readiness with task-specific benchmarks, similar to those used in medicine for checking an AI tool’s ability to take notes for a physician? The goal, however, is not to test an AI tool’s ability to perform a specific legal task, or even a long list of them, but to test whether it has general-purpose legal knowledge that it can exercise systematically when performing any task.

Why evaluating the impact of AI needs to start now

I am not suggesting that Sunstein, or any single authority, should be appointed as the arbiter of AI expertise. The goal is to build systems that leading legal specialists broadly agree demonstrate genuine, trustworthy legal knowledge. A ‘robo-lawyer’ would need to cope in a diverse range of interviews with panels of experts — ranging from tax and constitutional lawyers to clerks, traffic officers and legal-aid workers. Such an approach would reduce issues around individual or ideological bias and avoid the trap of AI models merely mimicking one person’s style.

Could a machine reach human levels of expertise, subtlety and ethics? Only specialists can say. But imagine a US Supreme Court justice grilling an AI robo-lawyer in public. That would get everyone’s attention. It would be a spectacle much like multinational technology corporation IBM’s 2011 challenge on the US television quiz programme Jeopardy!. The company pitted its supercomputer Watson against human champions to demonstrate how far machine reasoning and natural-language processing had come.

What's Hot

EU introduces €3 customs charge on small parcels to curb cheap Chinese imports | International trade

UK state threats bill could pull British journalists into terror prosecutions – experts | UK security and counter-terrorism

Five Americans die every hour from toxic vehicle emissions, study finds | US news

Why diagnostic test waiting lists are so long | NHS

Blood test can find thousands of genetic conditions in pregnancy, say scientists | Pregnancy

‘Significant breakthrough’: NHS hospitals adopt faster, more accurate bladder cancer test | NHS

The science influencers going viral on TikTok to fight misinformation

Watch Lady Gaga’s Perform ‘Vanish Into You’ on ‘Colbert’

Advertisers flock to Fox seeking an ‘audience of one’ — Donald Trump

At Chile’s Vera Rubin Observatory, Earth’s Largest Camera Surveys the Sky

SpaceX Starship Explodes Before Test Fire

How the L.A. Port got hit by Trump’s Tariffs

Most Popular