Science Current Events | Science News | Brightsurf.com
 
corner top left block corner top right

Improved method for comparing genomes as well as written text

January 29, 2009

BERKELEY - Taking a hint from the text comparison methods used to detect plagiarism in books, college papers and computer programs, University of California, Berkeley, researchers have developed an improved method for comparing whole genome sequences.

With nearly a thousand genomes partly or fully sequenced, scientists are jumping on comparative genomics as a way to construct evolutionary trees, trace disease susceptibility in populations, and even track down people's ancestry.

To date, the most common techniques have relied on comparing a limited number of highly conserved genes - no more than a couple dozen - in organisms that have all these genes in common.

The new method can be used to compare even distantly related organisms or organisms with genomes of vastly different sizes and diversity, and can compare the entire genome, not just a selected small fraction of the gene-containing portion known to code for proteins, which in the human genome is only 1 percent of the DNA.

The technique produces groupings of organisms largely consistent with current groupings, but with some interesting discrepancies, according to Sung-Hou Kim, professor of chemistry at UC Berkeley and faculty researcher at Lawrence Berkeley National Laboratory. However, the relative positions of the groups in the family tree - that is, how recently these groups evolved - are quite different from those based on conventional gene alignment methods.

The computational results have surprised scientists in being able to classify some bacteria and viruses that until now were enigmatic.

The technique, which employs feature frequency profiles (FFP), is described in a paper to appear this week in the early online edition of the journal Proceedings of the National Academy of Sciences.
Whole-genome vs. gene-centric methods

Current methods for comparing the genomes of different organisms focus on a small set of genes that the organisms being compared have in common. The genomes are then lined up in order to count the sequence similarities and differences, from which a computer program constructs a family tree, with near relatives assumed to have more similar sequences than distant relatives.

This technique assumes organisms have genes in common, however, or that these "homologous" genes can be identified. When comparing distantly related species - such as bacteria that live in vastly different environments - this gene-centric method may not work, Kim said.

"What do you do when one gene tells you the organisms are closely related, and another gene tells you they're distantly related?" he asked. "It happens."

Kim, who in the past focused on creating three-dimensional demographic maps of all known protein structures, wanted a technique that could be used to compare genomes of all sizes, and even genomes only partially sequenced. He also wanted a method that would compare all regions of the genome, not just the exons - that is, the DNA transcribed into mRNA, the blueprint for proteins. Exons make up only 1 percent of the human genome, with the remainder being non-coding "introns," regulatory DNA, duplicate or redundant DNA and transposons - genes that have jumped from other places in the genome.

Kim thought that traditional text comparison - used, for example, to assess the authorship of a work of literature or to identify plagiarized text - might provide a model for whole genome comparison and a way to test comparison methods. But while text comparison involves looking at word frequency; genomes cannot be broken down into words.

"I can compare two books in two different ways. I can pick a few sentences, say a hundred that I subjectively decided are important, and compare them, but some are very similar and some very different in the two books," he explained. "So, how can I decide? I need a second method to compare some features representing one whole book to those of the other whole book."

A different vocabulary

Teaming up with biophysicist Gregory E. Sims, statistical mathematician Se-Ran Jun and theoretical physicist Guohong A. Wu, Kim decided to try a simple variant of the word frequency technique. They eliminated all punctuation and spaces from a text, created a dictionary of all the two-letter, three-letter, and other word combinations in the books, and counted the variety of each fixed-length "word" or feature. The features were not consecutive letter combinations, but overlapping sequences obtained by sliding a two-, three- or more-letter window along the text, advancing one letter at a time.

In a test of free online books obtained through Project Gutenberg, they found that this method, which they called the feature frequency profile (FFP) method, was more successful at identifying related books - books by the same author, books of the same genre, books from the same historical era - than word frequency profile analysis. In fact, a good tree can be constructed by looking at a single "optimal" feature length, such as nine letters, where the "vocabulary" is very large, instead of looking at all possible lengths.

"I was just stunned when I saw this," Kim said. One of the reasons this method works better, he said, may be that, while word frequency analysis treats each word independently, feature frequency analysis picks up syntax.

"Here, if I take a 9-letter window and slide it along the text," he said, "I am actually picking up the relationship between the first and second words - the local syntax - which was impossible to pick up from the word frequency method. Apparently, that is very important."

Mammalian and bacterial genomes

Buoyed by this success, the researchers applied the technique to whole genomes of mammals, where there is the least controversy in evolutionary relationship. "We treat the genome like a book without spaces," Kim said.

Since these genomes are very large, the researchers translated the genome sequences using a reduced, two-letter alphabet - R for the purine nucleic acids, adenine and guanine, and Y for the pyrimidine nucleic acids, thymine and cytosine - to reduce the complexity of calculation. Using an optimal feature length of 18 base pairs, this test created a family tree identical to the phylogenetic trees constructed by scientists using genetic, morphological, anatomical, fossil and behavioral information. This was surprising, especially since the overwhelming majority of the mammalian genomes do not code for genes, Kim said.

Next, they tried the FFP method on 518 genomes, the bulk of them bacteria and Archaea, but also six eukaryotes of varying complexity and two random sequences. The eurkaryotic genomes used were as much as 1,000 times longer than the bacterial and Archaeal genomes. Because most of the bacterial and Archaeal genomes code for genes, as opposed to very little of the genomes of higher eukaryotes, the researchers used a different alphabet and vocabulary for the FFP method: short strings of amino acids, the building blocks of proteins, with a 20-word alphabet representing the 20 possible amino acids.

"The question is: Can we then group all living organisms based on the whole proteome, that is, the assembly of all proteins, instead of using just a selection of a small set of proteins, which is equivalent to using a small set of genes?" said Kim.

The researchers found that the FFP method clearly segregates whole proteomes of all bacteria, archaea, eukaryotes and random sequences into separate groups or domains. Most of the phylum groups within each domain and class groups in each phylum also were well segregated, with some interesting discrepancies compared to the currently accepted groupings.

In most of the cases where the FFP grouping disagreed with an accepted phylogenetic grouping, the problem organism had been the subject of debate among biologists because of conflicting conclusions from genetics, behavior and morphology, Kim said. The new method did classify several so-far unclassified bacteria, however.

The major differences are found not in how the organisms are grouped, but in the relative position of these groups in the organism trees, he said.
Viral genomes

Finally, Kim and his colleagues analyzed the genomes of several hundred viruses, including several that could not be classified.

"Some viruses have no or few highly conserved common genes to other viruses, thus, the gene alignment-based method cannot find relationship among such groups, but we think we can," he said.

Because of the vast amount of whole genome sequence data, all of Kim's analyses monopolized a computer cluster of 320 CPUs (central processing units) for over a year.

Kim stressed the major difference between FFP and gene-centric comparisons of genomes: FFP takes into account all or most of the DNA or protein sequences in the genome, while gene alignment analysis chooses a small set of genes out of large number of genes in each organisms, and uses that to represent the organism.

"The fallacy of the view that organisms can be represented by a small set of their genes is really due to our prejudice that genes are us," Kim said. "We know now, more and more, that this is oversimplification.

"It is likely that some of the observations we come up with will turn out to be wrong, but the method will evolve and get better and better as experts come in and tell us where we have gone wrong. The math is there, now we have to remove the human bias as much as possible."

In addition to applying the method to comparative genomics, Kim expects it will help in grouping and finding relationships among sets of other information, such as electronic information encoding text, sounds and images. It may also help in tracing human ancestry and disease demography using whole genome sequences, and in grouping of metagenomic data - the sequences of genome fragments from many organisms, most of which are unknown species, found in a given environmental niche or body organ.

Kim hopes someday to return to Shakespearean texts and sort out their provenance as well.

The work was funded by the National Institutes of Health and by a grant from the Korean Ministry of Education, Science and Technology.

University of California, Berkeley




Genomes 3

Genomes 3
by T.A. Brown (Author)


Covering molecular genetics from the basics through to genome expression and molecular phylogenetics, Genomes 3 is the latest edition of this pioneering textbook. Updated to incorporate the recent major advances, Genomes 3 is an invaluable companion for any undergraduate throughout their studies in molecular genetics. Genomes 3 builds on the achievements of the previous two editions by putting genomes, rather than genes, at the centre of molecular genetics teaching. Recognising that molecular biology research was being driven more by genome sequencing and functional analysis than by research into genes, this approach has gathered momentum in recent years.

Genome: The Autobiography of a Species in 23 Chapters (P.S.)

Genome: The Autobiography of a Species in 23 Chapters (P.S.)
by Matt Ridley (Author)


The genome's been mapped.
But what does it mean? Arguably the most significant scientific discovery of the new century, the mapping of the twenty-three pairs of chromosomes that make up the human genome raises almost as many questions as it answers. Questions that will profoundly impact the way we think about disease, about longevity, and about free will. Questions that will affect the rest of your life. Genome offers extraordinary insight into the ramifications of this incredible breakthrough. By picking one newly discovered gene from each pair of chromosomes and telling its story, Matt Ridley recounts the history of our species and its ancestors from the dawn of life to the brink of future medicine. From Huntington's disease to cancer, from the applications of gene therapy to...

A Primer of Genome Science, Third Edition

A Primer of Genome Science, Third Edition
by Gibson (Author), Muse (Author)


Genome science has matured as a discipline to the point where it is now incorporated as a regular part of the genetics curriculum in universities. A Primer of Genome Science, Third Edition bridges the gap between standard genetics textbooks and highly specialized, technical, and advanced treatments of the subdisciplines. It provides an affordable and up-to-date introduction to the field that is suited to advanced undergraduate or early graduate courses. Bioinformatic principles and experimental strategies are explained side-by-side with the experimental methods, establishing a framework that allows teachers to explore topics and the literature at their own pace. The Primer is organized into six chapters dealing with the scope of genomics, genome sequencing, variation and complex traits,...

Genomes 2

Genomes 2
by T.A. Brown (Author)


Covering molecular genetics from the genomics perspective, Genomes, has been completely rewritten to incorporate the major advances made in the past three years. The new edition includes: the sequencing of the human genome; characterization of genome expression and replication processes, and transcriptomics and proteomics.
Genomes 2 has been extended to include more introductory material making it appropriate for early undergraduate study. As with the first edition, the superb full-color illustrations are free to download.

KEY CHANGES FROM THE FIRST EDITION:
* Expanded introductory material
* New study aids: (Learning Objectives - Highlighted Key Terms - Self Study Questions - Problem-based Learning)
* New coverage of the transcriptome and proteome
* New chapter on...

The Human Genome: Book of Essential Knowledge (Curiosity Guides)

The Human Genome: Book of Essential Knowledge (Curiosity Guides)
by John Quackenbush (Author), John Sulston (Foreword)


The DNA sequence that comprises the human genome--the genetic blueprint found in each of our cells--is undoubtedly the greatest code ever to be broken. Completed at the dawn of a new millennium, the feat electrified both the scientific community and the general public with its tantalizing promise of new and better treatments for countless diseases, including Alzheimer's, cancer, diabetes, and Parkinson's. Yet what is arguably the most important discovery of our time has also opened a Pandora's box of questions about who we are as humans and how the unique information stored in our genomes can and might be used, making it all the more important for everyone to understand the new science of genomics. In The Curiosity Guide to the Human Genome, Dr. John Quackenbush, a renowned scientist and...

Genome Matrix: Sci Fi Thriller - Suspense

Genome Matrix: Sci Fi Thriller - Suspense


Ethan, a computer scientist working for GenGlobal a multinational genome mapping corporation, has the uncanny feeling that he is being followed by the shadowy Karabos and that his actions are being manipulated by artificial intelligence.

The only thing that stands between Ethan and the evil force that wants to silence him is his girlfriend, Emily.

ABOUT THE AUTHOR

London educated Jay Veramu previously worked as a high school teacher, a University Professor and Campus Director. Later he took up employment for the United Nations as a Curriculum Development Specialist, a Social Policy Specialist and a Project Manager. In his free time, he works with disadvantaged youths. He can be contacted on jayveramu@yahoo.com

My Beautiful Genome: Discovering Our Genetic Future, One Quirk at a Time

My Beautiful Genome: Discovering Our Genetic Future, One Quirk at a Time
by Lone Frank (Author)


Taking a uniquely cheeky approach, acclaimed writer Lone Frank swabs up her DNA to provide this first, intensely intimate account of the new science of personal genomics. She tests the limits of genetic fortune-telling, from opting for pre-emptive breast cancer surgery to picking a child's schooling based on promising DNA 'snips'. And she explores how much genes determine our destiny - a quest made gripping as Frank considers her family's and her own struggles with depression.

The Angel Genome

The Angel Genome
by Steel Magnolia Press


What if the legends of angels arose from an extinct human branch? Lucia doesn’t believe in angels — but she might believe in a little boy cloned from a forgotten race.

"The Angel Genome" is a complete short story excerpted from the anthology EXTINCT DOESN'T MEAN FOREVER.

- 5000 words
- About 20 pages

The Jesus Genome Project: Decoding the Conception and Immortality of Christ

The Jesus Genome Project: Decoding the Conception and Immortality of Christ
by ACW Press


How does an infinite Spirit-God become a man of flesh? How does eternal God take the form of a time-bound creature? No one has been able to answer these basic questions about the Incarnation of Jesus Christ.

Christ’s Incarnation is even more puzzling in light of present-day knowledge. If Mary was a virgin and Jesus had no human father, where did Jesus get the rest of his genes? Was he only half a person, with genes alone from Mary’s egg? Was he Mary’s clone, somehow having all her genes? Or was he genetically engineered by God?

Modern genetics can actually answer these startling questions about Christ’s fatherless conception! When we carefully pair the truths of modern science with Scripture, the mystery of the Incarnation unfurls before our eyes.

Genetics: From Genes to Genomes (Hartwell, Genetics)

Genetics: From Genes to Genomes (Hartwell, Genetics)
by Leland Hartwell (Author), Leroy Hood (Author), Michael Goldberg (Author), Ann Reynolds (Author), Lee Silver (Author)


Genetics: From Genes to Genomes is a cutting-edge, introductory genetics text authored by an unparalleled author team, including Nobel Prize winner, Leland Hartwell. The 4th edition continues to build upon the integration of Mendelian and molecular principles, providing students with the links between the early understanding of genetics and the new molecular discoveries that have changed the way the field of genetics is viewed. Users who purchase Connect Plus receive access to the full online ebook version of the textbook.

corner bottom left corner bottom right
© 2012 BrightSurf.com