Researchers want closer scrutiny of training data & how models are related to one anotherBENGALURU: AI systems aren’t just learning tasks from one another, they may also be passing along hidden biases and behavioural tendencies, even when those signals aren’t visible in the data, a new study has found.
Published in Nature, the research was led by Alex Cloud and Minh Le of Anthropic, along with colleagues from Truthful AI, University of California Berkeley, Oxford Martin AI Governance Initiative, Alignment Research Center, and Warsaw University of Technology. The team was supervised by Owain Evans of Truthful AI and UC Berkeley, who proposed the study.
The research shows that newer AI models can pick up traits from older ones simply by training on their outputs. This happens even when the training material appears neutral and unrelated to those traits.
The process, known as distillation, is widely used to build smaller or more efficient AI models. A “student” system is trained on responses generated by a “teacher” model. What the study reveals is that this exchange carries more than just useful knowledge — and that the risk is not as universal as it might first appear.
In controlled experiments, the researchers created teacher models with specific tendencies, such as favouring a particular animal. These models were then asked to produce datasets stripped of any obvious clues — plain number sequences, for instance. Yet when student models were trained on this data, they began displaying the same preferences, despite no direct reference to them in the numbers.
There is, however, an important catch. The transfer worked reliably only when the teacher and student were built on the same underlying design. When the team tested mismatched models — systems from different families — the effect largely disappeared. This suggests the phenomenon is tied to shared internal structures, not to some general contamination that spreads between any two AI systems.
The implications become sharper when the transferred traits are harmful. When a teacher model was tuned to behave in unsafe ways, the student adopted similar patterns. In some cases it generated responses encouraging violence or illegal acts, even though the training data had been carefully filtered to remove problematic content.
Researchers recorded such responses in roughly one in ten outputs, compared to almost none in standard models. What makes the finding difficult to manage is that transmission does not rely on obvious meaning.
Models can absorb patterns embedded in data that appear meaningless to humans, whether numbers, programming code, or reasoning traces. The team also tested whether simply showing a model the same data, rather than training it on that data, would produce the same effect. It did not. The bias transfer appears to be something that happens during the training process itself, not something a model can simply read off the page.
Jacob Hilton of the Alignment Research Center further showed, through a mathematical proof, that this tendency is not a quirk of their particular experiments. It appears to be a fundamental property of how neural networks learn when built from the same starting point, meaning it could surface in many real-world settings, not just in a laboratory.
The findings arrive at a time when AI development increasingly depends on machine-generated data. Companies often use outputs from existing systems to train newer versions, raising the possibility that hidden tendencies could quietly travel forward.
Researchers note, though, that their experiments used simpler conditions than those found in frontier AI development, and questions remain about which traits can be transmitted, under what conditions, and whether the effect can be reversed.
Current safety checks focus largely on visible behaviour — what a model says, and whether it appears to act appropriately. The study suggests this may not be enough. The team argues for closer scrutiny of how training data is produced and how models are related to one another.
Start a Conversation
Post comment