SuperCheck@SC'23: Fourth International Symposium on Checkpointing for Supercomputing

∗ November 12, 2023  ∗ 

Held in conjunction with SC23 and in cooperation with ACM

SuperCheck@SC'23: Fourth International Symposium on Checkpointing for Supercomputing

The Fourth International Symposium on Checkpointing for Supercomputing (SuperCheck-SC23) will be held in November 12, 2023 at Denver, USA, in conjunction with SC23: The International Conference for High Performance Computing, Networking, Storage and AnalysisThis workshop will feature the latest work in checkpoint/restart research, tools development and production use.

Important Dates


About the Workshop

As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of HPC communities. While there has been much C/R research and tools development, continued C/R research is indispensable to keep pace with ever-changing HPC architectures, technologies, and workloads. More effort is also needed to narrow the gap between proof-of-concept C/R research codes and production-quality codes capable of deployment in real-world workloads. In this workshop, we will bring together C/R researchers and tools developers, practitioners, application developers, and end users to focus on C/R research and successes in production use, motivating the development of usable C/R tools, the closing of the gap between state-of-the-art research and production, and the harnessing of the full benefits of C/R for the HPC community. Paper submissions will be peer-reviewed, and the accepted papers will be published with the IEEE Computer Society. We especially encourage PhD students and HPC end users to participate. 


Workshop Scope

Checkpointing is widely used in high performance computing (HPC). It involves capturing key states during the runtime of a distributed application (checkpointing), which are reused later during runtime. Initially widely applied in the HPC community for resilience purposes (checkpoint periodically, roll back and restart the application from a previously known correct state in case of failures), it has seen increasing adoption in many other scenarios: suspend-resume (checkpoint as a response to an event, such as a reservation running out of time or a job being preempted to make room for another job, then resume at a later time when more resources are available), migration (checkpoint on one machine, restart on another, potentially on different hardware), debugging (checkpoint close to a problematic region of code and replay that region multiple times instead of starting from the beginning). More recently, with an increasing convergence of HPC, big data analytics and machine learning, checkpointing is becoming an essential pattern in allowing applications to progress with their computations. For example, it is used to communicate states between tasks in a workflow, to revisit previous states (e.g. adjoint computations), or to explore alternative directions starting from a common ancestor (e.g. checkpoint models and/or training data to explore variations of architecture and/or training paths).

On the other hand, checkpointing is challenging: states are distributed, which means the checkpoints require coordination to capture globally consistent states, they incur high I/O overheads due to their size and competition for I/O bandwidth, they can be either explicitly defined by users or transparently determined at system-level, etc. With increasing scale and heterogeneity of supercomputing architectures, both from a computational and I/O perspective, such challenges are becoming even more difficult to overcome. 

As a consequence, there is a need to form a community around this essential yet difficult to address topic that is currently underserved in the HPC community. This workshop proposal aims to fill the aforementioned gap. It encourages interaction and cross-pollination between application developers that have both traditional and novel use cases for checkpointing, researchers that develop checkpointing approaches and runtimes/middlewares at all levels (system-level, application-level, transparent, hybrid), storage and I/O experts that need to manage massive data sizes generated by checkpointing, architecture experts that need to provide means of capturing the state of devices and other subsystems (which are needed in addition to user-level in-memory data structures). In this context, it envisions to become a forum where participants can (1) underline challenges, opportunities and solutions for novel research directions; (2) share their experience and best practices for production-runs; (3) engage in co-design activities (users learn about approaches and new capabilities of runtimes and middlewares, runtime developers learn about the needs of users).

The workshop scope includes but is not limited to: 

Furthermore, contributions on C/R use in production are also welcome:

We propose two tracks of paper submissions within the workshop, research and production. For the production track, we broaden the definition of novelty for our workshop, to include the work of incorporating novel research results into practice, resulting in a real-life impact. 

Participation

We encourage participation from researchers and end-users, professionals and students. 


Organizing Committee