Researchers run largest known transparent checkpointing process

November 23, 2016

A team of researchers led by Jiajun Cao, a PhD candidate in the College of Computer and Information Science (CCIS) at Northeastern University, recently completed what appears to be the largest known instance of transparent checkpointing.

Transparent checkpointing allows computer scientists and engineers working on large projects to save and reopen programs without modifying any code. This assures researchers working across hundreds or thousands of computers that their work will be safe in case of a computer failure. Their programs are run on CPU cores, and computers can contain multiple cores, allowing them to run one program simultaneously across multiple cores. Transparent checkpointing could simplify the work of computer scientists handling large amounts of data and using supercomputers to process that data. For example, with transparent checkpointing software, meteorologists can process and analyze billions of pieces of weather data without the fear that a computer crash could erase that work.

"The idea of checkpointing is that one can take a running computation, automatically stop it in the middle and save the state of everything to a file on disk," Gene Cooperman, a professor at CCIS and Cao's advisor, explains. "Then you can copy that file to another computer or keep it on the same one. When you restart, the program continues running from where it left off." Cooperman's work with Distributed Multi-Threaded CheckPointing (DMTCP) software, which is responsible for checkpointing, is now in its second decade.

What makes this example of transparent checkpointing significant is the massive amount of data that was run and saved in a short period of time. The MVAPICH software supporting the Message Passing Interface (MPI) was used to run the High Performance Conjugate Gradients (HPCG) program for linear algebra in parallel over 32,768 CPU cores on 2,048 computers. It used a total memory of 38 terabytes, and was checkpointed in 10 minutes and 53 seconds. A second program, Nanoscale Molecular Dynamics (NAMD), was run in parallel over 16,368 CPU cores on 1,024 computers, using a total memory of 10 terabytes. It was checkpointed in two minutes and 38 seconds. Checkpointing these amounts of data in 11 minutes or less is a breakthrough for scientists usually restricted by having to run their programs before modifying and saving them within 24-hour time slots.

These processes were carried out on the Stampede supercomputer at the Texas Advanced Computer Center (TACC). Stampede is one of the world's largest supercomputers. The research was supported by a grant from the National Science Foundation awarded to Cooperman's DMTCP project, under which Cao's checkpointing research falls.

"These results show how the Extended Collaborative Support Services from the National Science Foundation-supported Extreme Science and Engineering Discovery Environment can help scientists and developers improve the scalability and efficiency of their code on high performance computing clusters," says Jérôme Vienne, a research associate at TACC.

Dhabaleswar K. Panda, who leads the MVAPICH team at Ohio State, explains that "the results of this collaborative work push the existing capabilities of the MVAPICH2 library further in terms of fault-tolerance and check-pointing."

Cao's collaborators include Kapil Arya of Mesophere, Inc.; Rohan Garg and Gene Cooperman of Northeastern University; Shawn Matott of the Center for Computational Research at the State University of New York at Buffalo; Dhabaleswar K. Panda and Hari Subramoni of Ohio State University; and Jérôme Vienne of the Texas Advanced Computing Center at the University of Texas at Austin.

The paper, titled "System-level Scalable Checkpoint-Restart for Petascale Computing," is available to read online. This work will be published at the 22nd Institute of Electrical and Electronics Engineers International Conference on Parallel and Distributed Systems (ICPADS) in December 2016.

Northeastern University

Related Data Articles from Brightsurf:

Keep the data coming
A continuous data supply ensures data-intensive simulations can run at maximum speed.

Astronomers are bulging with data
For the first time, over 250 million stars in our galaxy's bulge have been surveyed in near-ultraviolet, optical, and near-infrared light, opening the door for astronomers to reexamine key questions about the Milky Way's formation and history.

Novel method for measuring spatial dependencies turns less data into more data
Researcher makes 'little data' act big through, the application of mathematical techniques normally used for time-series, to spatial processes.

Ups and downs in COVID-19 data may be caused by data reporting practices
As data accumulates on COVID-19 cases and deaths, researchers have observed patterns of peaks and valleys that repeat on a near-weekly basis.

Data centers use less energy than you think
Using the most detailed model to date of global data center energy use, researchers found that massive efficiency gains by data centers have kept energy use roughly flat over the past decade.

Storing data in music
Researchers at ETH Zurich have developed a technique for embedding data in music and transmitting it to a smartphone.

Life data economics: calling for new models to assess the value of human data
After the collapse of the blockchain bubble a number of research organisations are developing platforms to enable individual ownership of life data and establish the data valuation and pricing models.

Geoscience data group urges all scientific disciplines to make data open and accessible
Institutions, science funders, data repositories, publishers, researchers and scientific societies from all scientific disciplines must work together to ensure all scientific data are easy to find, access and use, according to a new commentary in Nature by members of the Enabling FAIR Data Steering Committee.

Democratizing data science
MIT researchers are hoping to advance the democratization of data science with a new tool for nonstatisticians that automatically generates models for analyzing raw data.

Getting the most out of atmospheric data analysis
An international team including researchers from Kanazawa University used a new approach to analyze an atmospheric data set spanning 18 years for the investigation of new-particle formation.

Read More: Data News and Data Current Events is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to