Nav: Home

Rapid validation for genome assemblies? Introducing KAT: K-mer Analysis Toolkit

December 05, 2016

Genome assembly projects are costly in both time and money; where identifying problems with your data post-assembly can be a real setback. With the K-mer Analysis Toolkit (KAT), researchers can access and confirm their results at every stage.

Genome assembly with NGS technologies is like trying to do the hardest jigsaw puzzle you can imagine. The final jigsaw represents the full genome, and the individual pieces represent small fragments of the genome read out by the sequencer. Counterintuitively, to make the data more manageable, it is actually easier to first break these pieces into even smaller pieces called K-mers.

K-mers represent small fragments of the original genome with a fixed number (K) of DNA base pairs. A computer can efficiently work with large quantities of K-mers, then identify connections between these fragments to build-up a representation of the original genome.

K-mer-based techniques are commonly used to efficiently generate genome assemblies, KAT, however, is built to examine and compare K-mer datasets, using each distinct K-mer's underlying properties, such as frequency and nucleotide composition.

Initially, KAT can analyse sequencing data to identify error levels, biases and contamination. Information from this analysis can help researchers decide whether to proceed with downstream tasks such as genome assembly. KAT can then internally back-check your assembly to determine completeness and accuracy without any external reference data - a really useful feature when studying new organisms.

Lead Software Developer, Daniel Mapleson, said on the new tool: "Imagine genome assembly like lego. Instead of trying to piece together long, 8x2-stud pieces with 6x2-stud pieces and 5x2-stud pieces, it's more like making a staircase pattern out of the smaller 2x2-bit pieces, overlapping one stud at a time.

"However, K-mers are not only useful for assembling a genome, by counting the number of K-mers in a sequencing dataset you can learn a lot about it. By looking at the K-mer frequency profiles (K-mer spectra) we can assess the quality of the sequencing data in the first instance, such as working out if the dataset is clean, contains contaminants or is biased in some way. KAT can give answers to these questions quickly, even for non-model organisms where a reference is not available."

Project Leader and corresponding author Bernardo Clavijo commented: "The first thing many researchers do after sequencing a genome is to use-check the K-mer spectra of their data. This tells you if the information you will need to assemble the genome is there before you spend a lot of time, effort and money on doing the rest of the analysis. Now with KAT, researchers can do all kinds of validation and information comparison at this initial stage; but to also carry this forward to validation, we have included the relevant information at the end of the assembly.

"In terms of assembly validation, the tool is particularly useful with diploid genomes that can carry more than one copy of a gene, certain regions can be falsely duplicated or deleted during assembly, leading the researcher to believe there's more or less copies of a gene than there really are. KAT can help to detect these artefacts by tracking both the data generated from the sequencer and data from the assembler, ultimately leading to faster, more accurate conclusions."
The paper titled: KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies is published in Bioinformatics.

For more information, read our article: KAT got your tongue? An analysis tool to quickly detect problems in sequencing data and genome assemblies.

Earlham Institute

Related Genome Articles:

Genome evolution goes digital
Dr. Alan Herbert from InsideOutBio describes ground-breaking research in a paper published online by Royal Society Open Science.
Breakthrough in genome visualization
Kadir Dede and Dr. Enno Ohlebusch at Ulm University in Germany have devised a method for constructing pan-genome subgraphs at different granularities without having to wait hours and days on end for the software to process the entire genome.
Sturgeon genome sequenced
Sturgeons lived on earth already 300 million years ago and yet their external appearance seems to have undergone very little change.
A sea monster's genome
The giant squid is an elusive giant, but its secrets are about to be revealed.
Deciphering the walnut genome
New research could provide a major boost to the state's growing $1.6 billion walnut industry by making it easier to breed walnut trees better equipped to combat the soil-borne pathogens that now plague many of California's 4,800 growers.
Illuminating the genome
Development of a new molecular visualisation method, RNA-guided endonuclease -- in situ labelling (RGEN-ISL) for the CRISPR/Cas9-mediated labelling of genomic sequences in nuclei and chromosomes.
A genome under influence
References form the basis of our comprehension of the world: they enable us to measure the height of our children or the efficiency of a drug.
How a virus destabilizes the genome
New insights into how Kaposi's sarcoma-associated herpesvirus (KSHV) induces genome instability and promotes cell proliferation could lead to the development of novel antiviral therapies for KSHV-associated cancers, according to a study published Sept.
Better genome editing
Reich Group researchers develop a more efficient and precise method of in-cell genome editing.
Unlocking the genome
A team led by Prof. Stein Aerts (VIB-KU Leuven) uncovers how access to relevant DNA regions is orchestrated in epithelial cells.
More Genome News and Genome Current Events

Trending Science News

Current Coronavirus (COVID-19) News

Top Science Podcasts

We have hand picked the top science podcasts of 2020.
Now Playing: TED Radio Hour

Listen Again: The Power Of Spaces
How do spaces shape the human experience? In what ways do our rooms, homes, and buildings give us meaning and purpose? This hour, TED speakers explore the power of the spaces we make and inhabit. Guests include architect Michael Murphy, musician David Byrne, artist Es Devlin, and architect Siamak Hariri.
Now Playing: Science for the People

#576 Science Communication in Creative Places
When you think of science communication, you might think of TED talks or museum talks or video talks, or... people giving lectures. It's a lot of people talking. But there's more to sci comm than that. This week host Bethany Brookshire talks to three people who have looked at science communication in places you might not expect it. We'll speak with Mauna Dasari, a graduate student at Notre Dame, about making mammals into a March Madness match. We'll talk with Sarah Garner, director of the Pathologists Assistant Program at Tulane University School of Medicine, who takes pathology instruction out of...
Now Playing: Radiolab

What If?
There's plenty of speculation about what Donald Trump might do in the wake of the election. Would he dispute the results if he loses? Would he simply refuse to leave office, or even try to use the military to maintain control? Last summer, Rosa Brooks got together a team of experts and political operatives from both sides of the aisle to ask a slightly different question. Rather than arguing about whether he'd do those things, they dug into what exactly would happen if he did. Part war game part choose your own adventure, Rosa's Transition Integrity Project doesn't give us any predictions, and it isn't a referendum on Trump. Instead, it's a deeply illuminating stress test on our laws, our institutions, and on the commitment to democracy written into the constitution. This episode was reported by Bethel Habte, with help from Tracie Hunte, and produced by Bethel Habte. Jeremy Bloom provided original music. Support Radiolab by becoming a member today at     You can read The Transition Integrity Project's report here.