Call for Participation

Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22)

Call for Participation

The Third International Symposium on Checkpointing for Supercomputing will be held on Monday, November 14, 2022 at Dallas, TX, USA, in conjunction with SC22: The International Conference for High Performance Computing, Networking, Storage and Analysis. This workshop will feature the latest work in checkpoint/restart research, tools development and production use.

About the Workshop

As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of HPC communities. While there has been much C/R research and tools development, continued C/R research is indispensable to keep pace with ever-changing HPC architectures, technologies, and workloads. More effort is also needed to narrow the gap between proof-of-concept C/R research codes and production-quality codes capable of deployment in real-world workloads. In this workshop, we will bring together C/R researchers and tools developers, practitioners, application developers, and end users to focus on C/R research and successes in production use, motivating the development of usable C/R tools, the closing of the gap between state-of-the-art research and production, and the harnessing of the full benefits of C/R for the HPC community. Paper submissions will be peer-reviewed, and the accepted papers will be published with IEEE Computer Society. We especially encourage PhD students and HPC end users to participate.

Background

Checkpointing is widely used in high performance computing (HPC). It involves capturing key states during the runtime of a distributed application (checkpointing), which are reused later during runtime. Initially widely applied in the HPC community for resilience purposes (checkpoint periodically, roll back and restart the application from a previously known correct state in case of failures), it has seen increasing adoption in many other scenarios: suspend-resume (checkpoint as a response to an event, such as a reservation running out of time or a job being preempted to make room for another job, then resume at a later time when more resources are available), migration (checkpoint on one machine, restart on another, potentially on different hardware), debugging (checkpoint close to a problematic region of code and replay that region multiple times instead of starting from the beginning). More recently, with an increasing convergence of HPC, big data analytics and machine learning, checkpointing is becoming an essential pattern in allowing applications to progress with their computations. For example, it is used to communicate states between tasks in a workflow, to revisit previous states (e.g. adjoint computations), or to explore alternative directions starting from a common ancestor (e.g. checkpoint models and/or training data to explore variations of architecture and/or training paths).

On the other hand, checkpointing is challenging: states are distributed, which means the checkpoints require coordination to capture globally consistent states, they incur high I/O overheads due to their size and competition for I/O bandwidth, they can be either explicitly defined by users or transparently determined at system-level, etc. With increasing scale and heterogeneity of supercomputing architectures, both from a computational and I/O perspective, such challenges are becoming even more difficult to overcome.

As a consequence, there is a need to form a community around this essential yet difficult to address topic that is currently underserved in the HPC community. This workshop aims to fill the aforementioned gap. It encourages interaction and cross-pollination between application developers that have both traditional and novel use cases for checkpointing, researchers that develop checkpointing approaches and runtimes/middlewares at all levels (system-level, application-level, transparent, hybrid), storage and I/O experts that need to manage massive data sizes generated by checkpointing, architecture experts that need to provide means of capturing the state of devices and other subsystems (which are needed in addition to user-level in-memory data structures). In this context, it envisions to become a forum where participants can (1) underline challenges, opportunities and solutions for novel research directions; (2) share their experience and best practices for production-runs; (3) engage in co-design activities (users learn about approaches and new capabilities of runtimes and middlewares, runtime developers learn about the needs of users).

Workshop Scope

The workshop scope includes but is not limited to:

Application-level checkpointing: APIs to define critical states, techniques to capture critical states (e.g. efficient serialization)
Transparent/system-level checkpointing: techniques to capture state of devices and accelerators (CPUs, GPUs, network interfaces, etc)
I/O and storage solutions that leverage heterogeneous storage to persist checkpoints at scale
Checkpoint size reduction techniques (compression, deduplication)
Alternative techniques that avoid persisting checkpoints to storage (e.g. erasure coding)
Synchronous vs. asynchronous checkpointing strategies
Multi-level and hybrid strategies combining application-level, system-level, transparent checkpointing on heterogeneous hardware
Application-specific techniques combined with checkpointing (e.g. ABFT)
Performance evaluation and reproducibility, study of real failures and their recovery
Research on optimal checkpointing interval, C/R-aware job scheduling and resource management

Furthermore, contributions on C/R use in production are also welcome:

Experience with traditional use cases of checkpointing on novel platforms
New use cases of checkpointing beyond resilience
Support on HPC systems (e.g., resource scheduling, system utilization, batch system integration, best practice, etc.)

We propose two tracks of paper submissions within the workshop, research and production. For the production track, we broaden the definition of novelty for our workshop, to include the work of incorporating novel research results into practice, resulting in a real-life impact.

Submission Guidelines

We invite authors to submit their original, high-quality work with the following categories:

(a) Regular papers:

Intended for submissions describing original work and ideas that have NOT appeared in another conference or journal, and are NOT currently under review for any other conference or journal. Both research and production tracks can submit regular papers. Regular paper submissions must be at least six (6) and must not exceed eight (8) pages in the IEEE format. The page limit will be increased to 10 for accepted submissions.

Accepted regular papers (subject to post-review revisions) will be published in the workshop proceedings in cooperation with IEEE Computer Society.

(b) Short papers:

Intended for material that is not mature enough for a full paper, allowing authors to present novel, interesting ideas or preliminary results that will be formally submitted elsewhere later. Short papers are also for authors sharing their new efforts on adopting C/R tools in production use. Short paper submissions must not exceed two (2) pages in the IEEE format. The page limit will be increased to 3 for accepted submissions.

Accepted short papers will NOT be included in the workshop proceedings published with the IEEE Computer Society; instead they will be published in arXiv. We will provide links to those short papers in arXiv on our workshop website as we did for our previous workshop.

Note that the page limit above includes figures and tables, but does not include references, for which there is no page limit.

All submissions should be made electronically through the SC22 submission website and must follow the IEEE format. Submissions must be double blind, i.e., authors should remove their names, institutions or hints found in references to earlier work. When discussing past work, they need to refer to themselves in the third person, as if they were discussing another researcher’s work. Furthermore, authors can identify any conflict of interest with the program committee members (reviewers) at the SC22 submission site after their papers are submitted (using the “My Conflicts” tab).

In addition to the paper categories above, which require new and unpublished work, authors can submit a short abstract (no more than 250 words) for a 5-minute lightning talk, for which both previously published and unpublished work are welcome. Lightning talks are to help the HPC community to stay informed about the existing C/R libraries and tools, C/R needs, support, approaches, and challenges in HPC applications and workflows, and to share experience on adopting C/R tools and libraries in production. They are also for authors to share ideas or proposals on addressing challenges in C/R to enable C/R on fast-changing HPC architectures and workloads and to generate real-life impacts. Authors will use the same SC22 submission website (selecting Lightning Talks for the Submission Track option). The workshop organizers will review the submissions based on the quality of work and relevance to the intended purposes of the lightning talks. The accepted abstracts will be made available on the SuperCheck-SC22 website.

Reproducibility Initiative

While an Artifact Description (AD) Appendix and the Artifact Evaluation (AE) are optional, we encourage authors to follow the SC22 reproducibility and transparency initiative. The SC22 details can be found at: https://sc22.supercomputing.org/submit/reproducibility-initiative/

Important Dates

Paper Submission Deadline: August 26, 2022 AOE September 5, 2022 AOE
Author Notification: September 9, 2022 AOE September 20, 2022 AOE
Workshop Ready Deadline: September 30, 2022 AOE
Workshop @SC22: November 14, 2022 (8:30am - 12:00pm)
Camera Ready Deadline: October 14, 2022 (11:59pm PT)

Organizing Committee

Zhengji Zhao, National Energy Research Scientific Computing Center(NERSC) at Lawrence Berkeley National Laboratory (LBNL)

Rebecca Hartman-Baker, NERSC at LBNL
Gene Cooperman, Northeastern University
Bogdan Nicolae, Argonne National Laboratory

Program Committee

Kapil Arya, Microsoft Research
Leonardo Bautista-Gomez, Barcelona Supercomputing Center
Franck Cappello, Argonne National Laboratory
Rohan Garg, Nutanix Corp
Twinkle Jain, Northeastern University
Zbigniew T Kalbarczyk, University of Illinois at Urbana-Champaign
Jack Kosaian, Carnegie Mellon University
Olga Kuchar, Oak Ridge National Laboratory
Kathryn Mohror, Lawrence Livermore National Laboratory, USA
Sarp Oral, Oak Ridge National Laboratory
Preeti Malakar, IIT Kanpur
Rafael Mayo-García. CIEMAT, Madrid
Dhabaleswar K. (DK) Panda, Ohio State University
Yves Robert, ENS Lyon
Kento Sato, RIKEN Center for Computational Science
Martin Schulz, Technical University of Munich
Tony Skjellum, University Tennessee, Chattanooga
Michael Sullivan, NVIDIA, USA
Osman Unsal, Barcelona Supercomputing Center, Spain

Contact:

Zhengji Zhao, zzhao@lbl.gov

Workshop Website: https://supercheck.lbl.gov

Page updated

Report abuse