Second International Symposium on Checkpointing for Supercomputing (SuperCheck-SC21)

∗ November 15, 2021 ∗

Held in conjunction with SC21 and in cooperation with TCHPC

Second International Symposium on Checkpointing for Supercomputing (SuperCheck-SC21)

The Second International Symposium on Checkpointing for Supercomputing (SuperCheck-SC21) will be held on November 15, 2021 at St. Louis, USA, in conjunction with SC21: The International Conference for High Performance Computing, Networking, Storage and Analysis. This workshop will feature the latest work in checkpoint/restart research, tools development and production use.

Important Dates

  • Call for Participation Release: June 14, 2021

  • Paper Submission Due: September 13, 2021 AOE September 20, 2021 AOE

  • Acceptance Notification: October 1, 2021 AOE

  • Workshop Ready Submission: October 7, 2021

  • Presentation slides and recordings: November 1, 2021

  • Workshop Date: November 15, 2021 (full day)

About the Workshop

As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of high performance computing (HPC) communities. While there has been much C/R research and tools development, continued C/R research is indispensable to keep pace with ever-changing HPC architectures, technologies, and workloads. More effort is also needed to narrow the gap between proof-of-concept C/R research codes and production-quality codes capable of deployment in real-world workloads. In this workshop, we will bring together C/R researchers and tools developers, practitioners, application developers, and end users to focus on C/R research and successes in production use, motivating the development of usable C/R tools, the closing of the gap between state-of-the-art research and production, and the harnessing of the full benefits of C/R for the HPC community. Paper submissions will be peer-reviewed, and a venue for accepted papers will be identified. We especially encourage PhD students and HPC end users to participate.

Workshop Scope

The workshop scope includes any and all aspects of checkpointing for science and engineering in the High Performance Computing (HPC) context, including the latest research results and development, deployment, and application experiences. The workshop scope includes but is not limited to:

C/R research and tools development:

  • C/R targeting the full range of supercomputing software, including MPI, OpenMP, GPGPU software, FPGAs, cloud, container, and serverless applications, etc.

  • Both pure and hybrid approaches to transparent checkpointing (some examples of hybrid approaches are: application-specific plugins to aid in checkpointing; and integrated modules for transparent checkpointing as part of larger scientific/engineering toolkits)

  • Frameworks for multi-level checkpointing

  • The development of new methods for low-overhead checkpointing, newer fundamental algorithms, software development methods, the impact of future supercomputer hardware, performance evaluation, and reproducibility, fault recovering

  • Research on optimal checkpointing interval, C/R-aware job scheduling and resource management

C/R use in production (including all levels of checkpointing: application, job, and system levels):

  • The adoption of transparent C/R tools in production workloads (C/R use cases)

  • The application-initiated use of C/R tools (alternative to built-in internal checkpointing)

  • C/R applications and support on HPC systems (e.g., resource scheduling, system utilization, batch system integration, best practice, etc.)

We propose two tracks of paper submissions within the workshop, research and production. For the production track, we broaden the definition of novelty for our workshop, to include the work of incorporating novel research results into practice, resulting in a real-life impact.

Participation

We encourage participation from researchers and end-users, professionals and students.


Organizing Committee

  • Zhengji Zhao, National Energy Research Scientific Computing Center(NERSC) at Lawrence Berkeley National Laboratory (LBNL)