Big data: Searching in large amounts of data quickly and efficiently

March 01, 2013

The term "big data" is defined as a huge amount of digital information, so big and so complex that normal database technology cannot process it. It is not only scientific institutes like the nuclear research center CERN that often store huge amounts of data ("Big Data"). Companies like Google and Facebook do this as well, and analyze it to make better strategic decisions for their business. How successful such an attempt can be was shown in a New York Times article published last year. It reported on the US-based company "Target" which, by analyzing the buying patterns of a young woman, knew about her pregnancy before her father did.

The analyzed amount of data is distributed on several servers on the internet. The search queries go to several servers in parallel. Traditional database management systems do not match all use cases. Either they cannot cope with big data, or they overstrain the user. Therefore data analysts love tools which are based on the open-source software framework Apache Hadoop and which use its efficient file system HDFS. Those do not require expert knowledge. "If you are used to the programming language Java, you can already do a lot with it", explains Jens Dittrich, professor of information systems at Saarland University. But he also adds that Hadoop is not able to query big datasets as efficiently as database systems that are designed for parallel processing.

Dittrich's and his colleague's solution is the development of the "Hadoop Aggressive Indexing Library", abbreviated with HAIL. It enables saving enormous amounts of data in HDFS in such a way that queries are answered up to 100 times faster. The researchers use a method which you can already find in a telephone book. So that you do not have to read the complete list of names, the entries are sorted according to surnames. The sorting of the names generates the so-called index.

The researchers generate such an index for the datasets they distribute on several servers. But in contrast to the telephone book, they sort the data according to several criteria at once and store it multiply. "The more criteria you provide, the higher the probability that you find the specified data very fast", Dittrich explains. "To use the telephone book example again, it means that you have six different books. Every one contains a different sorting of the data - according to name, street, ZIP code, city and telephone number. With the right telephone book you can search according to different criteria and will succeed faster." In addition to that, Dittrich and his research group managed to generate the indexes without any additional costs. He and his group members organized the indexing in such a way that no additional computing time and delay is required. Even the additional storage space requirement is low.
Computer science research on the campus of Saarland University

The Department of Computer Science is not the only research institution which is exploring new aspects of computer science. Only a few yards from there, you can also find the Max Planck Institute for Computer Science, the Max Planck Institute for Software Systems, the Center for Bioinformatics, the Center for IT-Security, Privacy and Accountability, the German Research Center for Artificial Intelligence, the Intel Visual Computing Institute and the recently renewed Cluster of Excellence "Multimodal Computing and Interaction".

See also:

Website for HAIL:

Press pictures:

Further questions are answered by:

Prof. Dr. Jens Dittrich
Tel. +49 681 302 70141

Gordon Bolduan
Science Communication
Cluster of Excellence
Phone: +49 681 302-70741
Cebit booth: +49 511/ 89497024

Saarland University

Related Big Data Articles from Brightsurf:

Predicting sports performance with "big data"
Smartphones and wearable devices are not simple accessories for athletes.

Big data could yield big discoveries in archaeology, Brown scholar says
Parker VanValkenburgh, an assistant professor of anthropology, curated a journal issue that explores the opportunities and challenges big data could bring to the field of archaeology.

Army develops big data approach to neuroscience
A big data approach to neuroscience promises to significantly improve our understanding of the relationship between brain activity and performance.

'Big data' for life sciences
Scientists have produced a co-regulation map of the human proteome, which was able to capture relationships between proteins that do not physically interact or co-localize.

Molecular big data, a new weapon for medicine
Being able to visualize the transmission of a virus in real-time during an outbreak, or to better adapt cancer treatment on the basis of the mutations present in a tumor's individual cells are only two examples of what molecular Big Data can bring to medicine and health globally.

Big data says food is too sweet
New research from the Monell Center analyzed nearly 400,000 food reviews posted by Amazon customers to gain real-world insight into the food choices that people make.

Querying big data just got universal
A universal query engine for big data that works across computing platforms could accelerate analytics research.

What 'Big Data' reveals about the diversity of species
'Big data' and large-scale analyses are critical for biodiversity research to find out how animal and plant species are distributed worldwide and how ecosystems function.

Big data takes aim at a big human problem
A James Cook University scientist is part of an international team that's used new 'big data' analysis to achieve a major advance in understanding neurological disorders such as Epilepsy, Alzheimer's and Parkinson's disease.

Small babies, big data
The first week of a newborn's life is a time of rapid biological change as the baby adapts to living outside the womb, suddenly exposed to new bacteria and viruses.

Read More: Big Data News and Big Data Current Events is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to