Schedule
Program:
All times are in MST (-7 hours):
2:00 pm - Welcome
2:05 pm - Keynote: AI-Augmented SWARM Based Resilience for Integrate Research Infrastructures (Cappello)
2:50 pm - Lightning Talk: Diaspora – Resilient Event Processing for Irregular, Distributed Scientific Applications (Wozniak)
3:00 pm - Afternoon Break
3:25 pm - Regular Talk Session
3:25 pm - Checkpoint/Restart for CUDA Kernels (Eiling, Lankes, Monti)
3:50 pm - Implementation-Oblivious Transparent Checkpoint-Restart for MPI (Xu, Belyaev, Jain, Schafer, Skjellum, Cooperman)
4:15 pm - Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics (Assogba, Nicolae, Van Dam, Rafique)
4:40 pm - Lightning Talk Session
4:40 pm - Update on Checkpointing and Localized Recovery for Nested Fork-Join Programs (Fohry)
4:50 pm - Toward Efficient Asynchronous Checkpointing for Large-Language Models (Maurya)
5:00 pm - Inherent Checkpointing Properties of Nested Parallelism (Bratanov)
5:10 pm - Trade-Offs For Developing File Aggregated I/O For Asynchronous Checkpointing (Gossman)
5:20 pm - Datastates for Debugging – Using Productive Checkpointing for Improved Debugging (Underwood)