NERSC hosted the First International Symposium on Checkpointing for Supercomputing (SuperCheck21) February 4-5, 2021, in collaboration with Northeastern University. The goals of the symposium were to showcase the latest research on C/R, to motivate the development of usable C/R tools, to boost the adoption of C/R tools in HPC production workloads, and to build an active and strong C/R community.
22 program committee (PC) members were selected from all around the world from academia, industry, national labs, and computing centers, consisting of people from all levels of their career including a PhD student.
SuperCheck21 had two themes: research and production. Submissions were original, high-quality work. For submissions to the research theme, some level of novelty was required. For the production submissions, we required some level of new effort to make checkpointing tools/approaches production-ready, or further insights into the internal design of a production system that were not previously published. This symposium was presentation only (no papers). Authors were required to submit a two-page extended abstract for peer-review, and were allowed to expand it to three pages upon acceptance.
Review was double blind. Each submission received three reviews (a couple of submissions received four reviews). The acceptance decision was made via a Program Committee meeting. There were 14 submissions in total, 9 were accepted, among which 4 of them were assigned a shepherd (a PC member) to improve their submission; One submission was converted to an invited talk.
There were two keynotes, given by Michela Taufer from University of Tennessee Knoxville (UTK) and Gene Cooperman from Northeastern University. There was one Invited Talk by Kathryn Mohror at Lawrence Livermore National Laboratory. There was a panel discussion, with panelists Kapil Arya (Microsoft Research, USA), Franck Cappello (Argonne National Laboratory), Dhabaleswar K. (DK) Panda (Ohio State University, USA), Martin Schulz (Technical University of Munich, Germany), and Osman Unsal (Barcelona Supercomputing Center, Spain). Rebecca Hartman-Baker from NERSC was the moderator. There were nine contributed talks: four from Research theme; five from Production theme. The symposium schedule is available here.
The symposium was online via Zoom (two half-days). Most presentations were pre-recorded except one keynote. Attendees were asked to type their questions in the Zoom chat and the session chair read questions to the presenter. For the panel session, a live discussion was allowed. The attendees were invited to mingle with other participants in breakout rooms during breaks. The symposium was recorded, and professionally captioned.
There were 253 registrations in total from 14 countries; more than 170 institutions, including US labs: ANL, LANL, LLNL, ORNL, Sandia Lab, BNL, SLAC, PNNL, Fermilab, LBNL; Industries: NVIDIA, HPE, SchedMD, IBM, Google, Microsoft, DDN, AWS. There were 191 participants in total: Day 1: 164; Day 2: 134. More than 50% of the participants stayed for more than 3 hours.
Abstracts were published at arXiv with an index table on our website. Presentation slides and recordings are published on our website as well. The whole symposium was recorded, and the recordings have been made available to registered attendees only, for at least 6 months after the symposium. If you missed the symposium and would like to access the symposium recordings, please register here.
The organizers would like to see the symposium held annually with the alternative hosts. They may move the symposium to SC or other HPC conferences in the future.
It was not possible to host the symposium without the help from many parties. The organizers first would like to thank the program committee members for their effort in ensuring a quality program. They reviewed the submissions, helped make acceptance decisions, shepherded the accepted submissions, served on the panelists, and chaired the sessions. The organizers would like to especially acknowledge the contributions of Gene Cooperman at Northeastern University to help organize the research content of this symposium. They would also like to thank Madelyn Blair and Zaida McCunney at NERSC for meeting logistics and IT help, and Rebekah Jin, a UCLA student, for designing the Symposium logo. Finally, they would like to thank all the authors and participants of SuperCheck21, without whom the symposium was not possible.