Bug repellent for supercomputers proves effective

November 14, 2012

Livermore, Calif. -- Lawrence Livermore National Laboratory (LLNL) researchers have used the Stack Trace Analysis Tool (STAT), a highly scalable, lightweight tool to debug a program running more than one million MPI processes on the IBM Blue Gene/Q (BGQ)-based Sequoia supercomputer.

The debugging tool is a significant milestone in LLNL's multi-year collaboration with the University of Wisconsin (UW), Madison and the University of New Mexico (UNM) to ensure supercomputers run more efficiently.

Playing a significant role in scaling up the Sequoia supercomputer, STAT, a 2011 R&D 100 Award winner, has helped both early access users and system integrators quickly isolate a wide range of errors, including particularly perplexing issues that only manifested at extremely large scales up to 1,179,648 compute cores. During the Sequoia scale-up, bugs in applications as well as defects in system software and hardware have manifested themselves as failures in applications. It is important to quickly diagnose errors so they can be reported to experts who can analyze them in detail and ultimately solve the problem.

"STAT has been indispensable in this capacity, helping the multi-disciplined integration team keep pace with the aggressive system scale-up schedule," said LLNL computer scientist Greg Lee.

"While testing a subsystem of Blue/Gene Q, my test program consistently failed only when scaled to 1,179,648 MPI processes. Although the test program was simple, the sheer scale at which this program ran made debugging efforts highly challenging. But when I applied STAT, it quickly revealed that one particular rank process was consistently stuck in a system call," said Dong Ahn, a computer scientist in Livermore Computing.

Based on this finding, a system expert took a close look at the compute core on which this rank process was running and discovered a hardware defect. "Replacing the component suddenly got the entire Sequoia system back to life," Ahn said. "Putting this exercise into perspective, this error was due to a defect in a tiny hardware unit, the decrementor, of a single hardware thread out of a total of 4.7 million hardware threads. I felt it was like finding a needle in a haystack over a coffee break."

Sequoia delivers 20 petaflops of peak power and was ranked No. 1 in June of this year's TOP500 list. It is currently ranked No. 2, behind Oak ridge National Laboratory's Titan.

LLNL plans to use Sequoia's impressive computational capability to advance understanding of fundamental physics and engineering questions that arise in the National Nuclear Security Administration's (NNSA) program to ensure the safety, security and effectiveness of the United States' nuclear deterrent without testing. Sequoia also will support NNSA/DOE programs at LLNL that focus on nonproliferation, counterterrorism, energy, security, health and climate change.

As LLNL takes delivery of the Sequoia system and works to move it into production, computer scientists will migrate applications that have been running on earlier systems to this newer architecture. This is a period of intense activity for LLNL's application teams as they gain experience with the new hardware and software environment.

"Having a highly effective debugging tool that scales to the full system is vital to the installation and acceptance process for Sequoia. It is critical that our development teams have a comprehensive parallel debugging tool set as they iron out the inevitable issues that come up with running on a new system like Sequoia," said Kim Cupps, leader of the Livermore Computing Division at LLNL.

STAT is particularly important for LLNL because supercomputer simulations are essential in virtually every mission area of the Laboratory. The tool also has been used at other sites and proved to be effective on a wide range of supercomputer platforms, including Linux clusters and Cray systems.

The team is actively pursuing further optimization of STAT technologies and is exploring commercialization strategies. More information about STAT, including a link to the source code, is available on the Web.
Founded in 1952, Lawrence Livermore National Laboratory provides solutions to our nation's most important national security challenges through innovative science, engineering and technology. Lawrence Livermore National Laboratory is managed by Lawrence Livermore National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration.

DOE/Lawrence Livermore National Laboratory

Related Running Articles from Brightsurf:

Running in Tarahumara culture
Running in Tarahumara (RarĂ¡muri) Culture. The Tarahumara (RarĂ¡muri) are a Native American people from Chihuahua, Mexico, who have long been famous for running, but there is widespread incredulity about how and why they run such long distances.

Big data reveals we're running out of time to save environment and ourselves
The paper, 'Opportunities for big data in conservation and sustainability', published today in Nature Communications, said increased computing speeds and data storage had grown the volume of big data in the last 40 years, but the planet was still facing serious decline.

Stanford engineers find ankle exoskeleton aids running
Researchers find that a motorized device that attaches around the ankle and foot can drastically reduce the energy cost of running.

What are savings of eliminating running water for hand scrubbing before surgery?
Researchers in this study estimated the potential water conservation and financial savings generated by eliminating running water for hand scrubbing before surgery in favor of exclusive use of an alcohol-based scrub at a large ophthalmic surgical hospital.

To best treat a burn, first cool with running water, study shows
New research in the January edition of Annals of Emergency Medicine reveals that cooling with running water is the best initial treatment for a child's burn.

'Reading' with aphasia is easier than 'running'
Neurolinguists from HSE University have confirmed experimentally that for people with aphasia, it is easier to retrieve verbs describing situations with several participants (such as 'someone is doing something'), although such verbs give rise to more grammar difficulties.

A breath of fresh air for longer-running batteries
DGIST researchers are improving the performance of lithium-air batteries, bringing us closer to electric cars that can use oxygen to run longer before they need to recharge.

Want to turn back time? Try running a marathon
The new year means it's time to set resolutions for 2020 and new research from the Journal of the American College of Cardiology suggests running a marathon for the first time could have several health benefits.

New lightweight, portable robotic suit to increase running and walking performance
A new study presenting a revolutionary robotic mobility-assistance suit was published in the journal Science this August.

Ecologist revives world's longest running succession study
With a grant from National Geographic, CU Denver assistant professor assembled a team to hunt down and expand eight long-forgotten, 103-year-old succession plots.

Read More: Running News and Running Current Events
Brightsurf.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com.