Chinese genomics giant BGI releases latest bioinformatics software and datasets

November 12, 2011

November 12, 2011, Shenzhen, China - BGI, the world's largest genomic organization, announces several bioinformatics analysis pipelines and software, including assembly and binning tools, genetic variation software, as well as two cloud-based green solutions for genomic-based research. In addition, GigaScience, an upcoming research journal published by BGI, announces the launch of its new, freely accessible, large-scale database: GigaDB. The launch of GigaDB is heralded by today's release of numerous large datasets of different types and from a variety of organisms. GigaDB is unique because it is directly affiliated with a journal and all of its datasets are assigned a Digital Object Identifier (DOI), which allows these data to be directly cited in future publications.

New Software and Pipelines

Today, on the first day of the "6th International Conference of Genomics" (ICG-6) hosted by BGI, the researchers reported the availability of and information on updated and newly available bioinformatics applications, pipelines, and tools. These include the Short Oligonucleotide Analysis Package (SOAP series etc.) and cloud-based software (Hecate 2, Gaea 2, GAMA, GSNP and Adam.) for Next-Gen data analysis, as well as others.

According to BGI's researchers, the updated SOAP series released today includes SOAP3, a GPU-accelerated short read alignment tool; SOAPindel, an indel finder; SOAPfusion, a gene fusion detector; SOAPsplice, a splice-junction detector; SOAPdenovo-Trans, a de novo transcriptome assembler; and Metacluster 4.0, a binning solving tool for metagenomics data. The SOAP toolkit is freely available at

Dr. Zhiyu Peng, Vice President of Research & Cooperation Division at BGI, gave a detailed introduction about SOAPsplice and SOAPfusion, which are two RNA-Seq data-based analytic tools designed specifically to detect splice junctions and gene fusions, respectively. Tests on SOAPsplice, using both simulated and real datasets, revealed its high sensitivity and high specificity. These qualities become more obvious under conditions of low sequencing depth. Analyses using SOAPfusion showed it currently has the highest sensitivity and lowest false discovery rate of all currently published gene-fusion detection tools.

In regard to these new tools, Dr. Peng stated that the "Emergence of the RNA-Seq technology provides unprecedented opportunities and accelerates the speed in the detection of fusion genes and splice junction sites. In particular, the gene fusion discovery performed by SOAPfusion provides an accurate and specific way which will greatly accelerate the study of genomic alternations in cancer as well as the therapeutic cancer studies."

SOAPdenovo-Trans is an assembler designed to handle alternative splicing and differing expression levels among transcripts for de novo transcriptome assembly using short RNA-Seq reads. Discussing this assembler, Dr. Yin Long Xie, Senior Bioinformatician of BGI, said, "We evaluated SOAPdenovo-Trans on samples of mouse and rice as the animal and plant models, and the results showed this assembler could provide a more accurate, complete and faster way to construct the transcript sets."

Another area that requires extensive next-gen data analysis is metagenome studies. Metagenomic data creates difficulties for researchers due to a fundamental computational problem - how to group together sequence reads from similar species - which is particularly relevant when carrying out binning. At the release conference, Prof. Sim-Ming Yiu from the University of Hong Kong gave a presentation on some existing solutions and Metacluster 4.0, the latest software tool, for providing an excellent means to solve this binning problem. According to Prof. Yiu, this tool is able to handle 100 species and at varying abundance ratios.

Cloud-based Green Solutions

With the rapid development of high-throughput sequencing technology over the past ten years, genomic studies have gradually become a standard approach in a wide range of research areas. Given that such research creates huge amounts of data, cloud computing is becoming a favorable solution for large-scale bioinformatic analysis, both in terms of resource utilization, flexibility, and efficiency, as well as time and cost savings for massive data generation and computation.

Many IT industries and large genomic organizations have been gradually shifting their analytical methods to use cloud-based green - more energy efficient - solutions for processing the enormous amounts of biological data. "With the cooperation with BGI, we have made many achievements in software development on green cloud computing." said Dr. Mian Lu from the Hong Kong University of Science and Technology, "A data processing pipeline has been re-implemented on GPU platform, and we have improved its efficiency: which could take only 6 hours to finish processing the data which needed 90 hours before."

One of the important green solutions that cloud computing provides is based on the extensively shortened computation times needed when using the software that is developed on specialized hardware. GSNP and GAMA are two discovery tools for genetic variation implemented on the GPU platform. GSNP is used to detect single-nucleotide polymorphisms, and GAMA is a software tool used to estimate allelic frequencies. Compared with its predecessor SOAPsnp, GSNP achieves higher performance through improved sparse representation for base information and the massive data parallelism on the GPU. Dr. Lu noted that, "Within about 2 hours, a former three days process on human genome, can be done using GSNP." The original version of GAMA could take up to a year or more to compute the allele frequencies for a group of 1,000 individuals, however, Dr. Lu noted that the new version of "GAMA can generate the result in two days."

Dr. Lu also talked about another tool called Adam that was "developed by exploiting hardware features, which could sort and remove duplicate from massive data. Its performance has been improved by three times, handling 150GB data with a node of 25GB memory," said Dr. Lu. For further information, about the new software and pipelines, please visit

In addition to their announcements on new software developments for specialized hardware, the BGI Bioinformatics Department also revealed their updated "flexible computing" solutions for de novo assembly and resequencing analyses: Hecate 2 and Gaea 2. Their original versions, Hecate 1 and Gaea 1, had been released in July of this year and had drawn significant attention worldwide from many biological researchers and news reporters.

In comparison with the former version, Hecate 2 has greater scalability than do the original algorithms, especially in terms of cost and time. "Hecate 2 adopts more sophisticated models for solving massive scale constraint optimization problems in de novo assembly in a fine-grained manner, which enables data from different sequencing platform to be assembled simultaneously and leads a dramatic improvement of the assembly quality in terms of accuracy, length and coverage," said Evan Xiang, R&D Director at the Flexible Computing Center of BGI.

Xiang also commented on Gaea 2, saying that it linearly increases processing speed with increasing cluster size and, added that, "the performance of Gaea 2 could surpass current available alignment software by aggregating their advanced functionalities into a unified cloud based solution."

GigaDB launched with release of additional 17 new large-scale datasets

GigaDB hosts publicly available, large-scale datasets and also provides every dataset with a unique DOI. A DOI enables researchers to specifically reference these datasets in independent publications where these data are used. GigaDB is associated with the journal GigaScience, an upcoming research journal published by BGI and BioMedCentral.

Today's launch of GigaDB is accompanied by the release of seventeen large datasets on top of those already hosted such as the genome of the recent deadly outbreak strain of E. coli O104. These datasets now span much of tree of life, with data hosted from plants, animals (vertebrate and invertebrate) and microbes. The plant data includes whole-genome data from the foxtail millet, the potato, the Chinese cabbage, the domestic cucumber, the pigeonpea, and sweet and grain sorghums. The animal data includes whole-genome data from three species of ants, a roundworm (Ascaris suum), the naked mole rat, the domestic sheep, domestic and wild silkworms, the Tibetan antelope, and three different datasets (whole genome, transcriptome, and methylome) from a single Asian man.

These data are all freely accessible and will be of great use for analyses being done in a wide range of life-science fields. The DOI issued to each dataset allows researchers to directly cite the data itself - as a separate entity from the data analysis papers. This is a major step in promoting extremely rapid data release. As data can now be cited directly, data producers can now be properly acknowledged and recognized for their work and no longer need to wait to release the data until a more extensive analysis paper has been written, reviewed, revised, and published. Additionally, DOIs make these data permanently accessible, easy to find and use, and available to replicate previous work. Five of these GigaDB newly released datasets illustrate the future of early data release: they are made available with a DOI, allowing the data producers to receive citable credit, for rapid use by the community before the analysis papers are published. The analysis paper for the sorghum genome has recently been accepted in Genome Biology and is expected to be published later this month, demonstrating a new gold standard of placing a dataset citation in the references where it can be easily tracked.
GigaDB is available at

About BGI

BGI was founded in Beijing, China, in 1999 with the mission to become a premier scientific partner for the global research community. The goal of BGI is to make leading-edge genomic science highly accessible, which it achieves through its investment in infrastructure, leveraging the best available technology, economies of scale, and expert bioinformatics resources. BGI, and its affiliates, BGI Americas and BGI Europe, have established partnerships and collaborations with leading academic and government research institutions as well as global biotechnology and pharmaceutical companies, supporting a variety of disease, agricultural, environmental, and related applications.

BGI has a proven track record of excellence, delivering results with high efficiency and accuracy for innovative, high-profile research: research that has generated over 170 publications in top-tier journals such as Nature and Science. BGI's many accomplishments include: sequencing one percent of the human genome for the International Human Genome Project, contributing 10 percent to the International Human HapMap Project, carrying out research to combat SARS and German deadly E. coli, playing a key role in the Sino-British Chicken Genome Project, and completing the sequence of the rice genome, the silkworm genome, the first Asian diploid genome, the potato genome, and, more recently, have sequenced the human Gut Metagenome, and a significant proportion of the genomes for the1000 Genomes Project,

For more information about BGI, please visit

About GigaScience:

GigaScience is an upcoming new journal that is co-published by BGI and BioMed Central. The journal aims to revolutionize data dissemination, organization, understanding, and use. GigaScience is an online open-access, open-data journal that publishes "big data" studies from the entire spectrum of life and biomedical sciences. To achieve its goals, the journal has a novel publication format: one that links standard manuscript publication with an extensive database that hosts all associated data and provides data analysis tools and cloud-computing resources. The scope of GigaScience covers large-scale life-science data, including imaging, neuroscience, ecology, medicine, systems biology, 'omics' and other types of large-scale sharable data. The impact of data sharing in these fields is enormous - it increases the visibility of labs and a researcher's work, enables stronger data analysis and interpretation, leads to better experimental reproducibility, promotes the development of new tools and methods, and creates new training opportunities for students.

Editor-in-Chief: Laurie Goodman, PhD; Editor: Scott Edmunds, PhD; Assistant Editor: Alexandra Basford, PhD. Contact:; Twitter: @gigascience

For more information about GigaScience and GigaDB, please visit: and

Contact Information:

Yingrui Li, Director
Science and Technology Department

Dr. Bicheng Yang
Public Communication Officer

Laurie Goodman, PhD
Natick, MA 01760 USA
Tel: 516-984-7477

Scott Edmunds, PhD
Tel: (852) 9249-0853; +86 13418914644

BGI Shenzhen

Related Genome Articles from Brightsurf:

Genome evolution goes digital
Dr. Alan Herbert from InsideOutBio describes ground-breaking research in a paper published online by Royal Society Open Science.

Breakthrough in genome visualization
Kadir Dede and Dr. Enno Ohlebusch at Ulm University in Germany have devised a method for constructing pan-genome subgraphs at different granularities without having to wait hours and days on end for the software to process the entire genome.

Sturgeon genome sequenced
Sturgeons lived on earth already 300 million years ago and yet their external appearance seems to have undergone very little change.

A sea monster's genome
The giant squid is an elusive giant, but its secrets are about to be revealed.

Deciphering the walnut genome
New research could provide a major boost to the state's growing $1.6 billion walnut industry by making it easier to breed walnut trees better equipped to combat the soil-borne pathogens that now plague many of California's 4,800 growers.

Illuminating the genome
Development of a new molecular visualisation method, RNA-guided endonuclease -- in situ labelling (RGEN-ISL) for the CRISPR/Cas9-mediated labelling of genomic sequences in nuclei and chromosomes.

A genome under influence
References form the basis of our comprehension of the world: they enable us to measure the height of our children or the efficiency of a drug.

How a virus destabilizes the genome
New insights into how Kaposi's sarcoma-associated herpesvirus (KSHV) induces genome instability and promotes cell proliferation could lead to the development of novel antiviral therapies for KSHV-associated cancers, according to a study published Sept.

Better genome editing
Reich Group researchers develop a more efficient and precise method of in-cell genome editing.

Unlocking the genome
A team led by Prof. Stein Aerts (VIB-KU Leuven) uncovers how access to relevant DNA regions is orchestrated in epithelial cells.

Read More: Genome News and Genome Current Events is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to