Researchers have developed an AI tool that can help determine whether unfamiliar bacteria carry genetic features linked to disease. By enabling the detection of harmful bacteria before they infect humans, this could transform pandemic preparedness. Researchers have developed an AI tool that can help determine whether unfamiliar bacteria carry genetic features linked to disease. By enabling the detection of harmful bacteria before they infect humans, this could transform pandemic preparedness.
PathogenFinder2 is a new AI tool developed by researchers at DTU in Denmark, in collaboration with international partners, to determine whether an unfamiliar bacterium possesses genetic characteristics associated with the ability to cause disease. The research has been published in Bioinformatics , one of the world’s leading journals in bioinformatics and computational biology. The research could significantly strengthen pandemic preparedness.
“The purpose of PathogenFinder2 is not only to characterise bacteria already known to be associated with disease, but also to assess the potential threat posed by new bacteria, even before the first infection has emerged. This could give authorities better opportunities to prevent outbreaks rather than simply reacting to them,” says Professor Frank Møller Aarestrup, Head of the Research Group for Genomic Epidemiology at the DTU National Food Institute.
The new AI tool forms part of the Global Pathogen Analysis Platform (GPAP) and is publicly available as a free online service .
“PathogenFinder2 can be used to investigate sewage, healthy humans and animals, and identify bacteria with pathogenic potential before they have caused their first infection, providing a basis for developing tests, vaccines, and treatments much earlier,” says researcher Alfred Ferrer Florensa, who carried out his PhD project on PathogenFinder2 at the DTU National Food Institute.
Why identifying risky bacteria is difficult
Most bacteria around us are harmless, and many support human health by aiding digestion, protecting the skin, or contributing to food production. Yet a small fraction can cause serious infections.
Climate change, expanding ecosystems, and growing exploration of microbial diversity mean that researchers are encountering more bacterial species than ever before, including many with no prior documentation. Assessing which of these may pose a risk is therefore a growing challenge.
Determining whether a bacterium can cause disease traditionally requires laboratory experiments that are slow, expensive, and often inconsistent. Computational approaches have helped speed up this process, but most rely on comparing a new organism to known pathogens, a method that breaks down when no close relatives exist.
“It was essential not only to make accurate predictions about bacterial threats resembling those we already know, but also to be prepared for the emergence of a completely new and previously unknown disease-causing bacterium,” says Alfred Ferrer Florensa.
PathogenFinder2 introduces a fundamentally new strategy. Instead of relying on similarity to known species, the model uses protein language models , advanced AI systems trained on millions of protein sequences. Much as text prediction tools learn patterns in human language, these models learn the language of proteins, allowing them to detect biochemical signals that traditional approaches miss.
“PathogenFinder2 is one of the first models to interpret whole bacterial genomes by leveraging the massive potential of language models. It performs significantly better than all previous models, particularly when it encounters bacterial species we have never seen before. In addition, it provides explanations for its predictions,” says PhD Alfred Ferrer Florensa.
The researchers emphasise that the model can identify interesting patterns and potential risks, but the results must be further examined before any final conclusions can be drawn.
PathogenFinder2 does more than produce a prediction. It highlights the specific proteins that most strongly influence its assessment.
These may include known virulence factors, such as toxins or attachment structures (features that help bacteria attach to human cells), as well as completely uncharacterised proteins that could play a role in disease.
This interpretability provides new avenues for research into diagnostics, vaccine targets, and mechanisms of infection, including proteins not previously linked to disease.
Using protein language models to represent full genomes also enabled the researchers to build the first Bacterial Pathogenic Capacity Landscape, a map showing how thousands of bacteria relate to one another based on their disease-linked features.
The landscape reveals clusters of bacteria that infect similar tissues or share metabolic strategies, offering a new way to explore microbial evolution and interactions.
“The Bacterial Pathogenic Capacity Landscape provides the first overview of all the disease‑causing bacteria that humans can be infected by. It reveals patterns and can, for example, show which bacteria tend to infect the same body sites or potentially rely on similar nutrients. This gives us new opportunities to investigate how bacteria evolve and interact,” says Alfred Ferrer Florensa.
Trained on 21,000 bacterial genomes
The researchers assembled the largest dataset to date of bacterial genomes with known disease-causing potential or known non-pathogenic behavior.
The dataset consisted of more than 21,000 bacterial genomes from international databases, including bacteria isolated from human infections, the healthy human microbiome, probiotic cultures, food production, and extreme environments, such as organisms capable of surviving in very hot or very cold conditions.
This gave the model a unique foundation for distinguishing between harmful and harmless bacteria, even when encountering previously undescribed species.
Read more
The study, entitled “Whole-genome prediction of bacterial pathogenic capacity on novel bacteria using protein language models with PathogenFinder2” , has been published in Bioinformatics.
The project is funded by the EU Horizon 2020 programme (grant 874735), the US National Institute of Allergy and Infectious Diseases under NIH (award U24AI183840), and the Novo Nordisk Foundation (grant NNF26SA0109818). It is also supported by the HPC RIVR Consortium and EuroHPC JU through access to computing resources.
Bioinformatics
10.1093/bioinformatics/btag129
Whole-genome prediction of bacterial pathogenic capacity on novel bacteria using protein language models with PathogenFinder2
20-Mar-2026
Jose Juan Almagro Armenteros is an employee of Bristol Myers Squibb Company at the time of the publication; however, that did not influence the research.