Machine learning generates realistic genomes for imaginary humans

February 05, 2021

Machines, thanks to novel algorithms and advances in computer technology, can now learn complex models and even generate high-quality synthetic data such as photo-realistic images or even resumes of imaginary humans. A study recently published in the international journal PLOS Genetics uses machine learning to mine existing biobanks and generate chunks of human genomes which do not belong to real humans but have the characteristics of real genomes.

"Existing genomic databases are an invaluable resource for biomedical research, but they are either not publicly accessible or shielded behind long and exhausting application procedures due to valid ethical concerns. This creates a major scientific barrier for researchers. Machine-generated genomes, or artificial genomes as we call them, can help us overcome the issue within a safe ethical framework," said Burak Yelmen, first author of the study and Junior Research Fellow of Modern Population Genetics at the University of Tartu.

The pluridisciplinary team performed multiple analyses to assess the quality of the generated genomes compared to real ones. "Surprisingly, these genomes emerging from random noise mimic the complexities that we can observe within real human populations and, for most properties, they are not distinguishable from other genomes from the biobank we used to train our algorithm, except for one detail: they do not belong to any gene donor," said Dr Luca Pagani, one of the senior authors of the study and a Mobilitas Pluss fellow.

The study additionally involves the assessment of the proximity of artificial genomes to real genomes to test whether the privacy of the original samples is preserved. "Although detecting privacy leaks among thousands of genomes could appear as looking for a needle in a haystack, combining multiple statistical measures allowed us to check all models carefully. Excitingly, the detailed exploration of complex leakage patterns can lead to improvements in generative model evaluation and design, and will fuel back the machine learning field," said Dr Flora Jay, the coordinator of the study and CNRS researcher in the Interdisciplinary computer science laboratory (LRI/LISN, Université Paris-Saclay, French National Centre for Scientific Research).

All in all, machine learning approaches had provided faces, biographies and multiple other features to a handful of imaginary humans: now we know more about their biology. These imaginary humans with realistic genomes could serve as proxies for all the real genomes which are not publicly available or require long application procedures or collaborations, hence removing an important accessibility barrier in genomic research, in particular for underrepresented populations.

Estonian Research Council

Related Genomes Articles from Brightsurf:

New wheat and barley genomes will help feed the world
An international research collaboration, including scientists from the University of Adelaide's Waite Research Institute, has unlocked new genetic variation in wheat and barley - a major boost for the global effort in breeding higher-yielding wheat and barley varieties.

Uncovering novel genomes from earth's microbiomes
As reported in Nature Biotechnology, the known diversity of bacteria and archaea has been expanded by 44% through a publicly available collection of more than 52,000 microbial genomes from environmental samples, resulting from a JGI-led collaboration involving more than 200 scientists (the IMG Data Consortium) around the world.

Researchers map genomes of agricultural monsters
The University of Cincinnati is unlocking the genomes of creepy agricultural pests like screwworms that feast on livestock from the inside out and thrips that transmit viruses to plants.

A new assembler for decoding genomes of microbial communities developed
The metaFlye assembler is designed to assemble DNA samples from microbial communities.

Unlocking the secrets of plant genomes in high resolution
Resolving genomes, particularly plant genomes, is a very complex and error-prone task.

Genomes published for major agricultural weeds
Representing some of the most troublesome agricultural weeds, waterhemp, smooth pigweed, and Palmer amaranth impact crop production systems across the US and elsewhere with ripple effects felt by economies worldwide.

ENCODE3: Interpreting the human and mouse genomes
An international consortium of approximately 500 scientists, led in part by researchers at Cold Spring Harbor Laboratory, reports on the completion of Phase 3 of the ENCODE project, providing a resource for scientists to understand how genetic variation shapes human health and disease.

MetaviralSPAdes -- New assembler for virus genomes
There was no specialized viral metagenome assembler until recently. But the joint team of Russian and US researchers from Saint-Petersburg State University and University of California at San Diego just released the metaviralSPAdes assembler (published in journal Bioinformatics on May 16) that turns the analysis of the metavirome sequencing results into an easy task.

Eleven human genomes in nine days
UC Santa Cruz researchers are helping drive advances in human genome assembly to make the process better, faster, and cheaper.

Hornwort genomes could lead to crop improvement
Fay-Wei Li from the Boyce Thompson Institute and researchers from across the globe sequenced the genomes of three hornworts, illuminating the dawn of land plants.

Read More: Genomes News and Genomes Current Events is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to