Nav: Home

How the names of organisms help to turn 'small data' into 'Big Data'

June 01, 2016

Innovation in 'Big Data' helps address problems that were previously overwhelming. What we know about organisms is in hundreds of millions of pages published over 250 years. New software tools of the Global Names project find scientific names, index digital documents quickly, correcting names and updating them. These advances help "Making small data big" by linking together to content of many research efforts. The study was published in the open access journal Biodiversity Data Journal.

The 'Big Data' vision of science is transformed by computing resources to capture, manage, and interrogate the deluge of information coming from new technologies, infrastructural projects to digitise physical resources (such as our literature from the Biodiversity Heritage Library), or digital versions of specimens and records about specimens by museums.

Increased bandwidth has made dialogue among distributed data centres feasible and this is how new insights into biology are arising. In the case of biodiversity sciences, data centres range in size from the large GenBank for molecular records and the Global Biodiversity Information Facility for records of occurrences of species, to a long tail of tens of thousands of smaller datasets and web-sites which carry information compiled by individuals, research projects, funding agencies, local, state, national and international governmental agencies.

The large biological repositories do not yet approach the scale of astronomy and nuclear physics, but the very large number of sources in the long tail of useful resources do present biodiversity informaticians with a major challenge - how to discover, index, organize and interconnect the information contained in a very large number of locations.

In this regard, biology is fortunate that, from the middle of the 18th Century, the community has accepted the use of latin binomials such as Homo sapiens or Ba humbugi for species. All names are listed by taxonomists. Name recognition tools can call on large expert compilations of names (Catalogue of Life, Zoobank, Index Fungorum, Global Names Index) to find matches in sources of digital information. This allows for the rapid indexing of content.

Even when we do not know a name, we can 'discover' it because scientific names have certain distinctive characteristics (written in italics, most often two successive words in a latinised form, with the first one - capitalised). These properties allow names not yet present in compilations of names to be discovered in digital data sources.

The idea of a names-based cyberinfrastructure is to use the names to interconnect large and small distributed sites of expert knowledge distributed across the Internet. This is the concept of the described Global Names project which carried out the work described in this paper.

The effectiveness of such an infrastructure is compromised by the changes to names over time because of taxonomic and phylogenetic research. Names are often misspelled, or there might be errors in the way names are presented. Meanwhile, increasing numbers of species have no names, but are distinguished by their molecular characteristics.

In order to assess the challenge that these problems may present to the realization of a names-based cyberinfrastructure, we compared names from GenBank and DRYAD (a digital data repository) with names from Catalogue of Life to assess how well matched they are.

As a result, we found out that fewer than 15% of the names in pair-wise comparisons of these data sources could be matched. However, with a names parser to break the scientific names into all of their component parts, those parts that present the greatest number of problems could be removed to produce a simplified or canonical version of the name. Thanks to such tools, name-matching was improved to almost 85%, and in some cases to 100%.

6/1/2016 The study confirms the potential for the use of names to link distributed data and to make small data big. Nonetheless, it is clear that we need to continue to invest more and better names-management software specially designed to address the problems in the biodiversity sciences.
-end-
Original source:

Patterson D, Mozzherin D, Shorthouse D, Thessen A (2016) Challenges with using names to link digital biodiversity information. Biodiversity Data Journal, doi: 10.3897/BDJ.4.e8080.

Additional information:

The study was supported by the National Science Foundation.

Pensoft Publishers

Related Biology Articles:

Experimental Biology press materials available now
Though the Experimental Biology (EB) 2020 meeting was canceled in response to the COVID-19 outbreak, EB research abstracts are being published in the April 2020 issue of The FASEB Journal.
Structural biology: Special delivery
Bulky globular proteins require specialized transport systems for insertion into membranes.
Cell biology: All in a flash!
Scientists of Ludwig-Maximilians-Universitaet (LMU) in Munich have developed a tool to eliminate essential proteins from cells with a flash of light.
A biology boost
Assistance during the first years of a biology major leads to higher retention of first-generation students.
Cell biology: Compartments and complexity
Ludwig-Maximilians-Universitaet (LMU) in Munich biologists have taken a closer look at the subcellular distribution of proteins and metabolic intermediates in a model plant.
Cell biology: The complexity of division by two
Ludwig-Maximilians-Universitaet (LMU) in Munich researchers have identified a novel protein that plays a crucial role in the formation of the mitotic spindle, which is essential for correct segregation of a full set of chromosomes to each daughter cell during cell division.
Cell biology: Dynamics of microtubules
Filamentous polymers called microtubules play vital roles in chromosome segregation and molecular transport.
The biology of color
Scientists are on a threshold of a new era of color science with regard to animals, according to a comprehensive review of the field by a multidisciplinary team of researchers led by professor Tim Caro at UC Davis.
Kinky biology
How and why proteins fold is a problem that has implications for protein design and therapeutics.
A new tool to decipher evolutionary biology
A new bioinformatics tool to compare genome data has been developed by teams from the Max F.
More Biology News and Biology Current Events

Trending Science News

Current Coronavirus (COVID-19) News

Top Science Podcasts

We have hand picked the top science podcasts of 2020.
Now Playing: TED Radio Hour

Listen Again: The Power Of Spaces
How do spaces shape the human experience? In what ways do our rooms, homes, and buildings give us meaning and purpose? This hour, TED speakers explore the power of the spaces we make and inhabit. Guests include architect Michael Murphy, musician David Byrne, artist Es Devlin, and architect Siamak Hariri.
Now Playing: Science for the People

#576 Science Communication in Creative Places
When you think of science communication, you might think of TED talks or museum talks or video talks, or... people giving lectures. It's a lot of people talking. But there's more to sci comm than that. This week host Bethany Brookshire talks to three people who have looked at science communication in places you might not expect it. We'll speak with Mauna Dasari, a graduate student at Notre Dame, about making mammals into a March Madness match. We'll talk with Sarah Garner, director of the Pathologists Assistant Program at Tulane University School of Medicine, who takes pathology instruction out of...
Now Playing: Radiolab

What If?
There's plenty of speculation about what Donald Trump might do in the wake of the election. Would he dispute the results if he loses? Would he simply refuse to leave office, or even try to use the military to maintain control? Last summer, Rosa Brooks got together a team of experts and political operatives from both sides of the aisle to ask a slightly different question. Rather than arguing about whether he'd do those things, they dug into what exactly would happen if he did. Part war game part choose your own adventure, Rosa's Transition Integrity Project doesn't give us any predictions, and it isn't a referendum on Trump. Instead, it's a deeply illuminating stress test on our laws, our institutions, and on the commitment to democracy written into the constitution. This episode was reported by Bethel Habte, with help from Tracie Hunte, and produced by Bethel Habte. Jeremy Bloom provided original music. Support Radiolab by becoming a member today at Radiolab.org/donate.     You can read The Transition Integrity Project's report here.