Media Contact: pr@cos.io
Embargoed for release until 11:00 a.m. ET on Wednesday, April 1, 2026
Findings from the Systematizing Confidence in Open Research and Evidence (SCORE) program—a collaborative effort involving 865 researchers—have been published in Nature as a collection of three papers alongside a release of five additional preprints. The SCORE program offers new empirical evidence on the reproducibility, robustness, and replicability of research across the social and behavioral sciences, and the predictability of replicability.
The SCORE program examined the capability of humans and machines to predict the replicability of research findings. In the process, SCORE accumulated an enormous database of information about the credibility of a large sample of findings from across the social and behavioral sciences. The program’s outcomes will contribute to strengthening how research is interpreted and communicated, work that supports authors, reviewers, funders, policymakers, and readers' understanding and use of research evidence. Improving credibility assessment will help direct attention and resources for further research to where they have the greatest impact in accelerating production of knowledge and solutions.
Funded by the U.S. Defense Advanced Research Projects Agency (DARPA), SCORE is a large-scale, multi-method research initiative designed to improve how scientific credibility is assessed in the social and behavioral sciences. The program examines multiple dimensions of research repeatability—including reproducibility, robustness, and replicability—to better understand the credibility of published findings from multiple perspectives. The SCORE team sampled claims from 3,900 papers published from 2009-2018 in 62 journals spanning criminology, economics, education, finance, health, management, marketing, organizational behavior, psychology, political science, public administration, and sociology. These claims were subjected to a variety of methods of credibility assessment.
The contributions of hundreds of researchers was coordinated by several lead teams. Sampling of claims, gathering of credibility measures, and conducting of replication and reproduction studies was coordinated by the Center for Open Science (COS). Human expert assessments were conducted by two independent teams, the repliCATS project and Replication Markets, to evaluate the viability and accuracy of forecasting research replicability. Three teams led by researchers at Pennsylvania State University, TwoSix Technologies, and the University of Southern California implemented machine-learning and algorithmic approaches to predicting replicability. And, the Metascience Lab from Eötvös Loránd University coordinated the robustness assessments.
A basic contribution of the program is to affirm emerging standards for some terminology related to credibility and trustworthiness of research. Specifically, reproducibility, robustness, and replicability refer to distinct aspects of the repeatability of evidence—an important component of creating generalizable knowledge. A preprint from Nosek and colleagues explains the terminology to support clear and consistent understanding.
Across its studies, SCORE findings suggest that reproducibility, robustness, and replicability each capture distinct aspects of research credibility, and that published claims vary in how well they hold up under these distinct forms of scrutiny. The following are brief summaries of each of the three papers appearing in Nature .
Reproducibility refers to conducting the same analysis on the same data and assessing whether the finding is the same as reported in the original paper.
As reported by Miske and 127 co-authors , SCORE revealed limited transparency, which makes reproducibility and robustness assessment infeasible. Data was available for only 24% of a sample of 600 assessed papers. For the 143 papers that were subjected to reproduction tests, 74% successfully reproduced at least approximately and 54% precisely. Success was associated with how much was shared from the original paper. Approximate (91%) and precise (77%) reproducibility was highest for papers where both original data and code were shared, and lowest (38% and 11%) when reanalysis required reconstructing the original dataset from public sources (e.g., retrieving census data and reconstructing the data management and analysis steps reported in the paper).
Robustness refers to conducting alternative reasonable analyses on the same data and assessing whether the findings are similar to what was reported in the original paper.
As reported by Aczel and 490 co-authors , SCORE revealed hidden uncertainty in research findings by conducting systematic testing of analytical robustness of 100 papers. For each paper, at least five independent analysts tested the same question with the same data, applying their own decisions about how to best analyze the data. 34% of independent reanalyses revealed the same result as the original finding within a narrow tolerance range (+/- 0.05 Cohen’s d units), and 57% revealed the same result with a tolerance range four times the size. Regarding the conclusions drawn, 74% of analyses were reported to arrive at the same conclusion as in the original investigation; 24% to no effects/inconclusive result, and 2% to the opposite effect as in the original investigation.
Replicability refers to testing the same question in new data and assessing whether the findings are similar to what was reported in the original paper.
As reported by Tyner and 291 co-authors , SCORE revealed that it is challenging to replicate original findings with independent data. Of 164 papers subjected to replication attempts, 49% replicated successfully according to the most common criterion for assessing replication (statistical significance with the same pattern as the original study), and the observed effect sizes for replication studies (0.10 in Pearson’s r units) were less than half the magnitude of the original studies (0.25).
The five preprints released alongside the Nature collection provide additional evidence about credibility and predictability of research findings:
Together, these eight papers offer the following conclusions:
“The main message of SCORE is a simple one: research is hard. And, in some ways, the hard work begins after making a discovery. A tremendous amount of effort is needed to verify and have enough confidence in new discoveries to build foundations for further discovery,” said Tim Errington, Senior Director of Research at COS and one of the SCORE project leaders.
The results reveal that there is no single indicator of the repeatability of evidence, or research credibility more generally. There is substantial opportunity for innovation in development of indicators to assess credibility to diversify the understanding of how trustworthy findings are established.
As another SCORE project leader, Fiona Fidler, Professor at the University of Melbourne, shared, “There are a lot of open questions about the factors that foster credibility and repeatability of research findings. Like many productive research efforts, SCORE generated insights, and has prompted even more questions about how to evaluate research in practice.”
In addition to its primary scientific findings, SCORE has generated openly accessible datasets, algorithms, and replication and reanalysis materials. These outputs will support further research on scientific credibility, potentially including development and validation of indicators to improve credibility assessment and accelerate discovery.
“With contributions from almost 900 researchers, the SCORE program provides an enormous amount of evidence to explore and inspire hypotheses for the next round of research. The data and materials are shared publicly so that others might build on this work,” said Sarah Rajtmajer, a SCORE project leader and Associate Professor at Pennsylvania State University.
Visit the website for an overview of the SCORE program, links to the papers, press releases for each paper, and other context to understand the project, findings, and implications.
###
About A+
The A+ system for automated assessment of replicability of claims was developed at TwoSix Technologies (Principal Investigator: James Gentile).
About the Center for Open Science (COS)
Founded in 2013, COS is a nonprofit culture change organization with a mission to increase openness, integrity, and trustworthiness of scientific research. COS pursues this mission by building communities around open science practices, supporting metascience research, and developing and maintaining free, open source software tools, including the Open Science Framework (OSF).
About MACROSCORE
A team led by Principal Investigator, Jay Pujara, at the University of Southern California, developed the MACROSCORE system for automated assessment of replicability of claims.
About Metascience Lab
The Metascience Lab at Eötvös Loránd University (Principal Investigator: Balazs Aczel) led the robustness studies conducted in association with the SCORE program.
About Replication Markets
A team of researchers led by Charles Twardy at Amentum developed and conducted prediction markets of human assessments of the replicability of research claims.
About the repliCATS project
A team led by Fiona Fidler at the University of Melbourne used a group structured deliberation approach to crowdsource human assessments of replicability of claims.
About Synthetic Markets
A team led by Sarah Rajtmajer at The Pennsylvania State University, developed bot-populated prediction markets to predict replicability of claims.
Nature
1-Apr-2026