In the early 1980s, Apple co-founder Steve Jobs described the computer as “a bicycle for our minds”. He was inspired by a Scientific American graphic he’d encountered as a boy, showing that a human on a bicycle is more energy-efficient than any animal1. The metaphor captured the promise of personal computing: tools that enable people to go further and faster with less effort. But the deeper brilliance of bicycles lies in what they do not do: they do not mimic human biology, nor any form found in nature. The bicycle reimagined motion entirely.
By comparison, I propose that artificial-intelligence agents are aeroplanes for the mind — they can speed things up for humans even more than bicycles do, but they are harder to control and the consequences of mistakes can be huge. And scientists are particularly poised to benefit from these tools. Scientific research is, at its core, a journey into the unknown. Yet working in new terrains brings unexpected challenges2 and frequent failures3.
Why we don’t really know what the public thinks about science
To push the frontiers of knowledge forwards quickly and responsibly, science and scientists urgently need a playbook for flying these aeroplanes. In my view, effective use of AI in research will probably require the development of AI agents that are grounded in robust, domain-specific scientific information. The real question is not whether machines will replace scientists, but what kind of scientists we will become when we learn to fly them.
To put this into practice, my team developed SciSciGPT4, a prototype multi-agent system in which several specialized AI agents divide and coordinate research workflows. We used the science of science5 — a field that combines large data sets and computational methods to probe the dynamics of scientific progress.
At the heart of SciSciGPT is the ResearchManager agent. It orchestrates the workflow, dividing a researcher’s natural-language query entered through a chat interface into tasks then delegating them to agents that specialize in literature review, data extraction or analysis. These agents plan and execute sub-tasks — retrieving publications, writing code, running analyses and generating figures — while the EvaluationSpecialist continuously audits their output. Each step is logged, creating a transparent end-to-end provenance record.
In our case studies4, SciSciGPT created a visualization of collaborations between a group of universities and tested whether a figure in a paper could be replicated from data in its repository. It completed these research tasks faster and with higher-quality results than experienced researchers did using AI tools.
Here, I outline the lessons that my team learnt from building a research-focused AI agent and the principles that scientists should consider when using agents for science.
Collaboration beats automation
The temptation today is to fully automate scientific workflows6–8, switching to ‘AI scientists’ or ‘self-driving laboratories’ that generate hypotheses, design experiments and draft manuscripts end-to-end. These systems can be dazzling, but science is not an assembly line, nor does it have fixed objectives to optimize. It is an enterprise that is built on interpretation, contestation and responsibility, in which human judgement is crucial.
For example, a fully automated system could conduct Newton’s prism experiments, measuring how white light splits as it passes through a prism and fitting those data to a model. But Newton did something categorically different: he reversed the set-up, recomposing the coloured beams back into white light, decisively showing that colour belongs to light itself, not to the glass. That act — deciding that an apparent anomaly was the phenomenon rather than an error to eliminate — was a leap of interpretation, not computation. Automated workflows, by design, smooth out anomalies and optimize towards fit. Scientists, by contrast, exploit surprise.
We need a new ethics for a world of AI agents
As AI tools become central to research, science faces not only a technological inflection point but also a civic one. The legitimacy of science rests on a shared social contract: that conclusions are open to scrutiny, that authors stand behind their evidence and that knowledge is produced in good faith for the public good.
In an era when public confidence in science is already fragile, this is the moment to strengthen the foundations that sustain it and to renew that contract by embedding transparency, traceability and accountability into the infrastructure of discovery itself. Full automation might deliver some answers, but it would erode the credibility that gives those answers meaning.
The more durable path is pilot-in-command science, in which the researcher — the captain — is assisted by an ensemble of agents that act as the crew and serve the best interests of the captain. The crew would be made up of an analyst agent to draft, a critic to probe, a planner to map next steps and an orchestrator to keep them in sync.
Interfaces should be built for steerability and disagreement, inviting researchers to inspect reasoning, compare alternatives and override conclusions. And human scientists should retain authority over — and responsibility for — framing the question, validating the path and signing off on conclusions. Making this model robust will require deliberate collaboration between scientists specializing in the domain of the study, engineers who work on AI, designers and ethicists to ensure that agents amplify human creativity rather than replace it.
Throughout history, discoveries have been made by humans. As AI becomes capable of contributing to discovery, the central question is not what machines can do alone, but how we design them to keep science accountable and reproducible.
Speed by itself is transformative
When the cost of failure collapses, riskier and more ambitious ideas become rational, making it practical to test questions that were once too costly or time-consuming.
Genomics illustrates this shift: decoding the first human genome took more than a decade and billions of dollars; today, sequencing costs less than US$1,000 and takes hours, transforming the field from studies of individual genes to broad exploration of entire genomes. And with the shift came fresh vantage points, enabling researchers to see connections across the scientific landscape that were otherwise invisible.
Speed also changes who can ask the questions. Lowered technical and temporal barriers enable small labs, newcomers and even individual researchers to tackle analyses that once required large teams and months of coordination. But the same forces that accelerate discovery can also amplify error. Fast science without reflection risks converging on mistakes at scale. This reinforces the importance of human–AI collaboration rather than full automation.
Agents should specialize
SciSciGPT was a natural first test case: the science of science is rich in data and methodologically diverse, and it studies how discovery itself works. But the same idea applies across disciplines, although the training data that grounds these agents will differ. Each field has its own foundations — its texts, data sets, tools and standards. In chemistry, this might mean databases tied to kinetic models that predict reaction rates and highlight where experiments tend to fail; in biomedicine, clinical guidelines linked to trial data, diagnostic protocols and multimodal patient information; in mathematics, formalized proof libraries.
AI research agents will look different in each field, but they should follow the same basic rules: results should be traceable, methods verifiable and responsibilities assigned clearly. Establishing those rules will require coordination between scientific societies, funders, journals, public research infrastructures and the AI labs building today’s models. The goal is a shared public–private framework for interoperability — for instance, common standards for logging agent decisions so that an analysis run in one lab can be audited or reproduced by another.
Some laboratories are trying to automate research work by using ‘AI scientists’ that perform projects from start to finish.Credit: Qilai Shen/Bloomberg via Getty
My team’s research shows that AI’s benefits to science are widespread across disciplines9. But when we analysed university syllabuses to examine how much each discipline teaches AI-related courses, we found a systematic mismatch: AI education is concentrated in computer science, mathematics and engineering, even though disciplines that could benefit just as much — from medicine and psychology to economics — offer much less training9.
At the same time, academia remains organized around departmental silos that drift farther apart as the burden of knowledge rises. Science policymakers should recognize these frictions and support institute-style structures that cut across disciplines and enable sustained collaboration between AI experts and domain scientists.
Trust must be engineered
When bicycles crash, the consequences are generally localized. Aeroplanes are different: when they crash, it can be catastrophic for everyone on board, often with collateral damage on the ground. That is the difference in scale that we face with AI agents. As they flourish, their failures won’t just inconvenience a single researcher; they could mislead fields, redirect funding and erode public trust in science.
One crucial advantage of large language models (LLMs) is that they can write. This means that they can document everything that they do in plain-language text. In our SciSciGPT project, every step, every line of code and every decision that the system generated was logged automatically by an LLM. The result was an overwhelming amount of data, but it was transformative. Even when my best students do an experiment, I cannot expect to see or reconstruct every step that led them to a result. With agents, I can.
AI could transform research assessment — and some academics are worried
Yet this brings another challenge: too much information. Some researchers complain that checking the AI’s output can take longer than doing the work themselves. The solution is to log not more, but better: to design systems that turn raw provenance into understanding.
