Nav: Home

Sifting gold from the data deluge

November 08, 2017

Next-generation DNA sequencing technologies have flooded databases and hard drives worldwide with large data sets, but are researchers getting the most they can out of this deluge of data? In a new study in the October issue of Applications in Plant Sciences, Dr. Brent Berger and colleagues propose one way to sift the remaining gold out of large sequence data sets. The authors show that a new data mining technique can be used to glean valuable information from existing data sets, and prove the concept by retrieving sequence from genes influencing the peculiar floral structures seen in the plant family Goodeniaceae.

DNA sequencing has become so cheap that even if a researcher is only really interested in the sequence of a few genes, it is often most practical to just sequence the whole genome. Bioinformatic techniques can pick out the desired gene sequence later, with less hassle than targeting specific genes to sequence. This practice, known as "genome skimming," has become an increasingly popular way to answer questions about relationships between plant species.

The premise of genome skimming is to use low-coverage shotgun sequencing to retrieve DNA sequence from high-copy fractions of the genome. In shotgun sequencing, the genome is broken up into small chunks for sequencing, and then stitched back together computationally using the overlaps between the chunks, a process called assembly. The amount of "coverage" corresponds to how many of those small chunks are sequenced; the higher the coverage, the easier it is to stitch the genome back together, resulting in a more complete genome sequence.

But higher coverage is more expensive, and some questions can be answered with a cheaper, low-coverage sequencing run. "High-copy fractions" of total genomic DNA, such as chloroplast genomes or nuclear ribosomal DNA, are in higher abundance in the sequence pool, and so can be fully sequenced even in cheap, low-coverage runs. Sequence from these high-copy genomic fractions are typically used to resolve evolutionary relationships between different species and groups. But in the process of genome skimming, researchers produce and then discard huge amounts of potentially valuable sequence data. "Many genome-skimming data sets are used for assembling the chloroplast genome, which in our case, only used 3% of the sequenced data," remarked Dr. Dianella Howarth, a co-author on the study.

In this study, the authors took a second look at a genome-skimming data set previously used to resolve evolutionary relationships in the Goodeniaceae, a family of plants commonly called "fan flowers" or "half flowers" due to their intriguing flower shape, which looks like somebody cut the flower in half. The authors wanted to see if this genome-skimming data set could be plumbed for more information on the genetics behind this unique floral structure. They used several software packages to assemble previously unused sequence fragments from the low-copy fraction of the original genome-skimming data set. They then searched the resulting assembly for sequence from a set of genes called CYCLOIDEA genes, which are involved in floral structure and symmetry.

The authors were able to retrieve enough portions of the genes, from multiple species, to create full alignments of all four CYCLOIDEA genes in the core Goodeniaceae. These data could prove useful for future studies on the evolution of the bizarre floral structure seen in this group. "Comparing sequences from CYCLOIDEA-like genes across this clade could provide clues about the precise sequence changes that result in changes in floral morphology," explained Dr. Howarth.

More generally, Dr. Howarth continued, "Pieces of any gene of interest could potentially be mined from genome-skimming data sets that have already been completed." A piece of a gene may not sound like much, but there are a surprising number of uses for these fragments. "These data could provide enough information to determine useful nuclear regions for phylogenetic analyses or pinpoint possible gene duplication events. Additionally, probes for target enrichment sequencing could be generated quickly across a clade to examine candidate genes and their regulatory regions in evo-devo studies."

Data mining approaches like these allow for much fuller use of genome-skimming data sets. This allows for important questions to be answered with existing data, and opens the door to scientists without access to the resources to produce large-scale data sets--for example, scientists at smaller colleges or countries without large grant-making bodies. As DNA sequence data continue to flood in, studies such as this point to ways to make sure we don't let valuable information float by.
Brent A. Berger, Jiahong Han, Emily B. Sessa, Andrew G. Gardner, Kelly A. Shepherd, Vincent A. Ricigliano, Rachel S. Jabaily, and Dianella G. Howarth. 2017. The unexpected depths of genome-skimming data: A case study examining Goodeniaceae floral symmetry genes. Applications in Plant Sciences 5(10): 1700042. doi:10.3732/apps.1700042

Applications in Plant Sciences (APPS) is a monthly, peer-reviewed, open access journal focusing on new tools, technologies, and protocols in all areas of the plant sciences. It is published by the Botanical Society of America , a nonprofit membership society with a mission to promote botany, the field of basic science dealing with the study and inquiry into the form, function, development, diversity, reproduction, evolution, and uses of plants and their interactions within the biosphere. APPS is available as part of BioOne's Open Access collection.

For further information, please contact the APPS staff at

Botanical Society of America

Related Genome Articles:

Genome evolution goes digital
Dr. Alan Herbert from InsideOutBio describes ground-breaking research in a paper published online by Royal Society Open Science.
Breakthrough in genome visualization
Kadir Dede and Dr. Enno Ohlebusch at Ulm University in Germany have devised a method for constructing pan-genome subgraphs at different granularities without having to wait hours and days on end for the software to process the entire genome.
Sturgeon genome sequenced
Sturgeons lived on earth already 300 million years ago and yet their external appearance seems to have undergone very little change.
A sea monster's genome
The giant squid is an elusive giant, but its secrets are about to be revealed.
Deciphering the walnut genome
New research could provide a major boost to the state's growing $1.6 billion walnut industry by making it easier to breed walnut trees better equipped to combat the soil-borne pathogens that now plague many of California's 4,800 growers.
Illuminating the genome
Development of a new molecular visualisation method, RNA-guided endonuclease -- in situ labelling (RGEN-ISL) for the CRISPR/Cas9-mediated labelling of genomic sequences in nuclei and chromosomes.
A genome under influence
References form the basis of our comprehension of the world: they enable us to measure the height of our children or the efficiency of a drug.
How a virus destabilizes the genome
New insights into how Kaposi's sarcoma-associated herpesvirus (KSHV) induces genome instability and promotes cell proliferation could lead to the development of novel antiviral therapies for KSHV-associated cancers, according to a study published Sept.
Better genome editing
Reich Group researchers develop a more efficient and precise method of in-cell genome editing.
Unlocking the genome
A team led by Prof. Stein Aerts (VIB-KU Leuven) uncovers how access to relevant DNA regions is orchestrated in epithelial cells.
More Genome News and Genome Current Events

Trending Science News

Current Coronavirus (COVID-19) News

Top Science Podcasts

We have hand picked the top science podcasts of 2020.
Now Playing: TED Radio Hour

Listen Again: The Power Of Spaces
How do spaces shape the human experience? In what ways do our rooms, homes, and buildings give us meaning and purpose? This hour, TED speakers explore the power of the spaces we make and inhabit. Guests include architect Michael Murphy, musician David Byrne, artist Es Devlin, and architect Siamak Hariri.
Now Playing: Science for the People

#576 Science Communication in Creative Places
When you think of science communication, you might think of TED talks or museum talks or video talks, or... people giving lectures. It's a lot of people talking. But there's more to sci comm than that. This week host Bethany Brookshire talks to three people who have looked at science communication in places you might not expect it. We'll speak with Mauna Dasari, a graduate student at Notre Dame, about making mammals into a March Madness match. We'll talk with Sarah Garner, director of the Pathologists Assistant Program at Tulane University School of Medicine, who takes pathology instruction out of...
Now Playing: Radiolab

What If?
There's plenty of speculation about what Donald Trump might do in the wake of the election. Would he dispute the results if he loses? Would he simply refuse to leave office, or even try to use the military to maintain control? Last summer, Rosa Brooks got together a team of experts and political operatives from both sides of the aisle to ask a slightly different question. Rather than arguing about whether he'd do those things, they dug into what exactly would happen if he did. Part war game part choose your own adventure, Rosa's Transition Integrity Project doesn't give us any predictions, and it isn't a referendum on Trump. Instead, it's a deeply illuminating stress test on our laws, our institutions, and on the commitment to democracy written into the constitution. This episode was reported by Bethel Habte, with help from Tracie Hunte, and produced by Bethel Habte. Jeremy Bloom provided original music. Support Radiolab by becoming a member today at     You can read The Transition Integrity Project's report here.