Announcements

We are pleased to announce SuperCheck-SC21 plenary speaker, invited spear and panel list.

Plenary speaker

Prof. Anthony (Tony) Skjullum

Anthony (Tony) Skjellum studied at Caltech (BS, MS, PhD). His PhD work emphasized portable, parallel software for large-scale dynamic simulation, with a specific emphasis on message-passing systems, parallel nonlinear and linear solvers, and massive parallelism. From 1990-93, he was a computer scientist at LLNL focusing on performance-portable message passing and portable parallel math libraries. From 1993-2003, he was on the faculty in Computer Science at Mississippi State University, where his group co-invented the MPICH implementation of the Message Passing Interface (MPI) together with colleagues at Argonne National Laboratory. From 2003-2013, he was professor and chair at the University of Alabama at Birmingham, Dept. of Computer and Information Sciences. In 2014, he joined Auburn University as Lead Cyber Scientist and led R&D in cyber and High-Performance Computing for over three years. In Summer 2017, he joined the University of Tennessee at Chattanooga as Professor of Computer Science, Chair of Excellence, and Director, SimCenter, where he continues work in HPC (emphasizing MPI, scalable libraries, and heterogeneous computing) and Cybersecurity (with strong emphases on IoT and blockchain technologies). He is a senior member of ACM, IEEE, ASEE, and AIChE, and an Associate Member of the American Academy of Forensic Science (AAFS), Digital & Multimedia Sciences Division.

Here is Prof. Skjellum's plenary abstract:

In this SuperCheck plenary, the audience will undoubtedly hear about newer and better ways to checkpoint and restart scalable, typically MPI+X (where X=GPU or OpenMP or other accelerator), applications. This plenary reviews all the pieces that make up an MPI+X application. It looks at silos, policies, and opportunities for communities to work better together. As a designer of message passing libraries for over 30 years and MPI implementations since its inception, this speaker seeks to bring his perspective on "the other components" such as resource managers and Checkpoint-Restart (CPR) libraries. Since this is a SuperCheck workshop, focus on MPI+X with CPR will be in the forefront, but interactions with other important components ---including what we could do better --- is mentioned. Opportunities for standardization of more interfaces are described.


The following themes are considered:

* The goal of using MPI+X applications in places where resources are more ephemeral or subject to major cost changes, motivating malleability;

* An MPI-designer's viewpoint of how checkpoint restart systems (both explicit and transparent) fit within a more open policy, integrated world;

* What MPI-5 and beyond could or should do to help the entire program stack work better including the CPR library; and

* The lack of open, potentially standardized mechanisms for multiple components to work smoothly to manage malleable resources, faults, migration, etc.

Invited Speaker

Dr. Debora Bard

Debbie Bard is a physicist and data scientist with more than 15 years experience in scientific computing, both as a physicist and HPC expert. Her career spans research in particle physics, cosmology and computing, with a common theme of using supercomputing for scalable data analytics. She currently leads the Data Science Engagement Group at the National Energy Research Scientific Computing center (NERSC), supporting HPC for users of the DOE’s experimental and observational facilities. Debbie also leads the Superfacility project, a cross-discipline project of over 30 researchers and engineers that is coordinating research and development to support computing for experimental science at LBNL.


Here is Dr Bard's invited talk abstract:

Computing has been an important part of the scientists toolkit for decades, but the increasing volume and complexity of scientific datasets is transforming the way we think about the use of computing for experimental science. DOE supercomputing facilities have begun to expand services and provide new capabilities in support of experiment workflows via powerful computing, storage and networking systems. Experiment teams increasingly look to HPC centers for realtime data analysis to monitor, steer and analyse a running experiment. In this talk I will introduce how supercomputing at NERSC is being leveraged in experimental science to change the way we collect and analyze data in fields as diverse as particle physics, cosmology, materials science and structural biology. Through case studies and real-life challenges, I will describe the science requirements that are driving our work, and how this translates into technical innovations with a particular focus on scheduling and policy decisions.

Panel Discussion: Can checkpoint/restart tools ever keep pace with fast-changing HPC architectures, technologies, and workloads?

Panel Moderator: Dr. Rebecca Hartman-Baker

Rebecca Hartman-Baker leads the User Engagement Group at the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory. She is a computational scientist with expertise in the development of scalable parallel algorithms. Her career has taken her to Oak Ridge National Laboratory, where she worked on the R&D100-award winning team developing MADNESS and as a scientific computing liaison in the Oak Ridge leadership computing facility; the Pawsey Supercomputing Centre in Australia, where she coached two teams to the Student Cluster Competition at SC and led the decision-making process for determining the architecture of Australia’s first petascale supercomputer; and NERSC, where she's responsible for NERSC’s engagement with the user community to increase user productivity via advocacy, support, training, and the provisioning of usable computing environments. Rebecca earned a PhD in Computer Science, with a certificate in Computational Science and Engineering, from the University of Illinois at Urbana-Champaign.

SuperCheck-SC21 Panel List

Prof. Gene Cooperman

Prof. Cooperman leads the DMTCP project (Distributed Multi-Threaded CheckPointing) for transparent checkpointing. The project began in 2004, and has benefited from a series of PhD theses. Over 100 refereed publications cite DMTCP as having contributed to their research project. Prof. Cooperman's current interests center on the frontiers of extending transparent checkpointing to new architectures. The DMTCP project has been applied by others to VLSI circuit simulators, circuit verification (e.g., by Intel, Mentor Graphics, and others), formalization of mathematics, bioinformatics, network simulators, high energy physics, cyber-security, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and of course high performance computing (HPC). Prof. Cooperman is currently involved in a collaboration with NERSC to create a robust, easy-to-use platform for transparent checkpointing for MPI (MANA sub-project) and CUDA (CRAC sub-project). This platform will be freely available to HPC sites and others, everywhere.

Dr. Bogdan Nicolae

Bogdan Nicolae is a Computer Scientist at Argonne National Laboratory, USA. In the past, he held appointments at Huawei Research Germany and IBM Research Ireland. He specializes in scalable storage, data management and fault tolerance for large scale distributed systems, with a focus on high performance architectures cloud computing. He holds a PhD from University of Rennes 1, France and a Dipl. Eng. degree from Politehnica University Bucharest, Romania. He is interested by and authored numerous papers at the intersection of high performance computing, cloud computing and machine learning in areas such as data and metadata decentralization and availability, big data analytics, multi-versioning, checkpointing, storage elasticity and virtualization, live migration.


Dr. Sarp Oral

Dr. Sarp Oral is the Group Leader for the Technology Integration Group and a Distinguished Research Scientist at the National Center of Computational Sciences (NCCS) Division of Oak Ridge National Laboratory. Sarp holds a PhD in Computer Engineering from University of Florida. He joined the ORNL in 2006 and his research and development interests are parallel I/O and file system technologies, benchmarking, high-performance computing and networking, fault-tolerance.

Dr. Eric Roman

Eric Roman is a computer systems engineer at Lawrence Berkeley National Laboratory. He joined LBNL in 1999. From 2004-2006 he took leave to pursue a Ph.D. In physics at the University of California at Berkeley, where he performed ab initio simulations of nonlinear optical properties of semiconductors, spin transport in metals, and the anomalous Hall effect. He completed his doctoral dissertation entitled “Orientation Dependence of the Anomalous Hall Effect in 3d Ferromagnets” in 2010. His research at LBNL focuses on operating systems for high-performance computing. He participated in the development of Berkeley Lab's Checkpoint/Restart (BLCR) since the start of the project in 2001. In 2018, he joined the Computational Systems Group at the National Energy Scientific Research Center.

Dr. John Shalf

John Shalf is Department Head for Computer Science Lawrence Berkeley National Laboratory, and recently was deputy director of Hardware Technology for the DOE Exascale Computing Project. Shalf is a coauthor of over 80 publications in the field of parallel computing software and HPC technology, including three best papers and the widely cited report “The Landscape of Parallel Computing Research: A View from Berkeley” (with David Patterson and others). He also coauthored the 2008 “ExaScale Software Study: Software Challenges in Extreme Scale Systems,” which set the Defense Advanced Research Project Agency’s (DARPA’s) information technology research investment strategy. Prior to coming to Berkeley Laboratory, John worked at the National Center for Supercomputing Applications and the Max Planck Institute for Gravitation Physics/Albert Einstein Institute (AEI) where he was was co-creator of the Cactus Computational Toolkit.

Description

Checkpoint/restart (C/R) tools of the past and present are constrained by the software and hardware architectures of the systems they are developed to run upon. They then chase the underlying hardware architecture as it evolves. In practical terms, this means C/R tools may not be ready to use until midway through the life span of a cutting-edge supercomputing system. This has been severely limiting HPC communities from reaping the benefits of C/R. Given the fact that software and hardware are fast evolving and becoming more complicated and heterogeneous, the development cycles for C/R tools will be getting longer to support new hardware and the workloads on it. Can checkpointing tools ever catch up with fast-changing HPC architectures, technologies, and workloads?


This panel brings together experts on both application-level and transparent checkpointing, operating systems, I/O and storage systems, computer architectures and future technologies to debate and discuss the most productive approaches for developing ready-to-use checkpointing tools for future HPC systems.