Nav: Home

A better statistical estimation of known Syrian war victims

June 05, 2018

HOUSTON -- (June 5, 2018) -- Researchers from Rice University and Duke University are using the tools of statistics and data science in collaboration with Human Rights Data Analysis Group (HRDAG) to accurately and efficiently estimate the number of identified victims killed in the Syrian civil war.

In a paper available online and due for publication in the June issue of the Annals of Applied Statistics, the scientists report on a four-year effort to combine a data-indexing method called "hashing with statistical estimation." The new method produces real-time estimates of documented, identified victims with a far lower margin of error than existing statistical methods for finding duplicate records in databases.

"Throwing out duplicate records is easy if all the data are clean -- names are complete, spellings are correct, dates are exact, etc.," said study co-author Beidi Chen, a Rice graduate student in computer science. "The war casualty data isn't like that. People use nicknames. Dates are sometimes included in one database but missing from another. It's a classic example of what we refer to as a 'noisy' dataset. The challenge is finding a way to accurately estimate the number of unique records in spite of this noise."

Using records from four databases of people killed in the Syrian war, Chen, Duke statistician and machine learning expert Rebecca Steorts and Rice computer scientist Anshumali Shrivastava estimated there were 191,874 unique individuals documented from March 2011 to April 2014. That's very close to the estimate of 191,369 compiled in 2014 by HRDAG, a nonprofit that helps build scientifically defensible, evidence-based arguments of human rights violations.

But while HRDAG's estimate relied on the painstaking efforts of human workers to carefully weed out potential duplicate records, hashing with statistical estimation proved to be faster, easier and less expensive. The researchers said hashing also had the important advantage of a sharp confidence interval: The range of error is plus or minus 1,772, or less than 1 percent of the total number of victims.

"The big win from this method is that we can quickly calculate the probable number of unique elements in a dataset with many duplicates," said Patrick Ball, HRDAG's director of research. "We can do a lot with this estimate."

Shrivastava said the sharpness of the hashing estimate is due to the technique used to index the casualty records. Hashing involves converting a complete data record -- a name, date, place of death and gender in the case of each Syrian war casualty -- into one number called a hash. Hashes are produced by an algorithm that considers the alphanumeric information in a record, and they are stored in a hash table that works much like the index in a book. The more textual similarity there is between two records, the closer together their hashes are in the table.

"Our method -- unique entity estimation -- could prove to be useful beyond just the Syrian conflict," said Steorts, assistant professor of statistical science at Duke.

She said the algorithm and methodology could be used for medical records, official statistics and industry applications.

"As we collect more and more data, duplication is becoming a more timely and socially important problem," Steorts said. "Entity resolution problems need to scale to millions and billions of records. Of course, the most accurate way to find duplicate records is having an expert check every record. But this is impossible for large data sets, since the number of pairs that needs to be compared grows dramatically as the number of records increase."

For example, a record-by-record analysis of all four Syrian war databases would entail some 63 billion paired comparisons, she said.

Shrivastava, assistant professor of computer science at Rice, said, "If you make assumptions, like dates that are close might be duplicates, you can reduce the number of comparisons that are needed, but every assumption comes with a bias, and ultimately you want an unbiased estimate. One statistical approach that avoids bias is random sampling. So perhaps choose 1 million random pairs out of the 63 billion, see how many are duplicates and then apply that rate to the entire dataset. This produces an unbiased estimate, which is good, but the likelihood of finding duplicates purely by random is quite low, and that gives a high variance.

"In this case, for example, random sampling could also estimate the documented counts at around 191,000," he said. "But it couldn't tell us with any certainty whether the count was 176,000 or 216,000 or some number in between.

"In recent work, my lab has shown that hashing algorithms that were originally designed to do search can also be used as adaptive samplers that precisely mitigate the high variance associated with random sampling," Shrivastava said.

"Resolving every duplicate seems very appealing," he said, "but it is the harder way of estimating the number of unique entities. The new theory of adaptive sampling with hashing allows us to directly estimate unique entity counts efficiently, with high confidence, without resolving the duplicates."

"At the end of the day, it's been phenomenal to make methodological and algorithmic progress motivated by such an important problem," Steorts said. "HRDAG has paved the way. Our goal and hope is that our efforts will prove useful to their work."

Shrivastava and Steorts said they are planning future research to apply the hashing technique for unique entity approximation to other types of datasets.
The research was funded by the National Science Foundation and the Air Force Office of Scientific Research.

High-resolution IMAGES are available for download at:
CAPTION: Destroyed tanks in front of a mosque in Azaz, Syria, in 2012. (Photo by Christiaan Triebert via Wikimedia Commons)
CAPTION: Anshumali Shrivastava and Beidi Chen (Photo by D. Soward/Rice University)
CAPTION: Rebecca Steorts (Photo courtesy R. Steorts/Duke University)

The paper, "Unique Entity Estimation with Application to the Syrian Conflict," is available at:

Related machine learning research from Rice:

Rice U. scientists slash computations for deep learning -- June 1, 2017

Researchers working toward indoor location detection -- April 17, 2017

Computer Science's Shrivastava wins NSF CAREER Award -- March 6, 2017

Rice, Baylor team sets new mark for 'deep learning' -- Dec. 16, 2016

Rice's energy-stingy indoor mobile locator ensures user privacy -- Oct. 20, 2016

Rice wins interdisciplinary 'big data' grant -- July 12, 2016

This release can be found online at

Follow Rice News and Media Relations via Twitter @RiceUNews.

Located on a 300-acre forested campus in Houston, Rice University is consistently ranked among the nation's top 20 universities by U.S. News & World Report. Rice has highly respected schools of Architecture, Business, Continuing Studies, Engineering, Humanities, Music, Natural Sciences and Social Sciences and is home to the Baker Institute for Public Policy. With 3,970 undergraduates and 2,934 graduate students, Rice's undergraduate student-to-faculty ratio is just under 6-to-1. Its residential college system builds close-knit communities and lifelong friendships, just one reason why Rice is ranked No. 1 for quality of life and for lots of race/class interaction and No. 2 for happiest students by the Princeton Review. Rice is also rated as a best value among private universities by Kiplinger's Personal Finance. To read "What they're saying about Rice," go to

Rice University

Related Rice Articles:

High-protein rice brings value, nutrition
A new advanced line of rice, with higher yield, is ready for final field testing prior to release.
Rice plants engineered to be better at photosynthesis make more rice
A new bioengineering approach for boosting photosynthesis in rice plants could increase grain yield by up to 27 percent, according to a study publishing January 10, 2019 in the journal Molecular Plant.
Can rice filter water from ag fields?
While it's an important part of our diets, new research shows that rice plants can be used in a different way, too: to clean runoff from farms before it gets into rivers, lakes, and streams.
Rice plants evolve to adapt to flooding
Although water is essential for plant growth, excessive amounts can waterlog and kill a plant.
Breeding better Brazilian rice
Rice production in Brazil is a multi-billion-dollar industry. It employs hundreds of thousands of people, directly and indirectly.
Breakthrough in battle against rice blast
Scientists have found a way to stop the spread of rice blast, a fungus that destroys up to 30% of the world's rice crop each year.
More rice, please: 13 rice genomes reveal ways to keep up with ever-growing population
Rice provides 20% of daily calories consumed globally. We will need more as population grows toward 9-10 billion by 2050.
Ancient rice heralds a new future for rice production
Growing in crocodile infested billabongs in the remote North of the country, Australia's wild rice has been confirmed as the most closely related to the ancient ancestor of all rices.
2-faced 2-D material is a first at Rice
Rice University materials scientists replace all the atoms on top of a three-layer, two-dimensional crystal to make a transition-metal dichalcogenide with sulfur, molybdenum and selenium.
Multi-nutrient rice against malnutrition
ETH researchers have developed a new rice variety that not only has increased levels of the micronutrients iron and zinc in the grains, but also produces beta-carotene as a precursor of vitamin A.
More Rice News and Rice Current Events

Trending Science News

Current Coronavirus (COVID-19) News

Top Science Podcasts

We have hand picked the top science podcasts of 2020.
Now Playing: TED Radio Hour

Listen Again: Reinvention
Change is hard, but it's also an opportunity to discover and reimagine what you thought you knew. From our economy, to music, to even ourselves–this hour TED speakers explore the power of reinvention. Guests include OK Go lead singer Damian Kulash Jr., former college gymnastics coach Valorie Kondos Field, Stockton Mayor Michael Tubbs, and entrepreneur Nick Hanauer.
Now Playing: Science for the People

#562 Superbug to Bedside
By now we're all good and scared about antibiotic resistance, one of the many things coming to get us all. But there's good news, sort of. News antibiotics are coming out! How do they get tested? What does that kind of a trial look like and how does it happen? Host Bethany Brookeshire talks with Matt McCarthy, author of "Superbugs: The Race to Stop an Epidemic", about the ins and outs of testing a new antibiotic in the hospital.
Now Playing: Radiolab

Dispatch 6: Strange Times
Covid has disrupted the most basic routines of our days and nights. But in the middle of a conversation about how to fight the virus, we find a place impervious to the stalled plans and frenetic demands of the outside world. It's a very different kind of front line, where urgent work means moving slow, and time is marked out in tiny pre-planned steps. Then, on a walk through the woods, we consider how the tempo of our lives affects our minds and discover how the beats of biology shape our bodies. This episode was produced with help from Molly Webster and Tracie Hunte. Support Radiolab today at