Announcements

Keynote speakers

We are pleased to announce two keynote speakers of the symposium! The first day keynote speaker will be Prof. Michela Taufer (University of Tennessee, Knoxville), and the second day keynote speaker will be Prof. Gene Cooperman (Northeastern University).

Prof. Michela Taufer

Michela Taufer is an ACM Distinguished Scientist and holds the Jack Dongarra Professorship in High Performance Computing in the Department of Electrical Engineering and Computer Science at the University of Tennessee Knoxville (UTK). She earned her undergraduate degrees in Computer Engineering from the University of Padova (Italy) and her doctoral degree in Computer Science from the Swiss Federal Institute of Technology or ETH (Switzerland). From 2003 to 2004 she was a La Jolla Interfaces in Science Training Program (LJIS) Postdoctoral Fellow at the University of California San Diego (UCSD) and The Scripps Research Institute (TSRI), where she worked on interdisciplinary projects in computer systems and computational chemistry.

Michela has a long history of interdisciplinary work with scientists. Her research interests include scientific applications on heterogeneous platforms (i.e., multi-core platforms and accelerators); performance analysis, modeling and optimization; Artificial Intelligence (AI) for cyberinfrastructures (CI); AI integration into scientific workflows, computer simulations, and data analytics. She has been serving as the principal investigator of several NSF collaborative projects. She also has significant experience in mentoring a diverse population of students on interdisciplinary research. Michela's training expertise includes efforts to spread high-performance computing participation in undergraduate education and research as well as efforts to increase the interest and participation of diverse populations in interdisciplinary studies.

Here is Prof. Taufer's keynote abstract:

AI4IO: A SUITE OF AI-BASED TOOLS FOR IO-AWARE HPC RESOURCE MANAGEMENT

High performance computing (HPC) is undergoing many changes at the system level. While scientific applications can reach petaflops or more in computing performance, potentially resulting in larger data generation rates and more frequent checkpointing, the data movement to the parallel file system remains costly due to constraints imposed by HPC centers on the IO bandwidth. In other words, the bandwidth to file systems is outpaced by the rate of data generation; the associated IO contention increases job runtime and delays execution. This situation is aggravated by the fact that when users submit their jobs to a HPC system, they rely on resource managers and job schedulers to monitor and manage the computing resources (i.e., nodes). Both resource managers and job schedulers remain blind to the impact of IO contention on the overall simulation performance.

In this talk we discuss how Artificial Intelligence (AI) can augment HPC systems to prevent and mitigate IO contention while dealing with IO bandwidth constraints. Our solution, called Analytics for IO (AI4IO), consists of a suite of AI-based tools that enable IO-awareness on HPC systems. Specifically, we present two AI4IO tools: PRIONN and CanarIO. PRIONN automates predictions about user-submitted job resource usage, including per-job IO bandwidth; CanarIO detects, in real-time, the presence of IO contention on HPC systems and predicts which jobs are affected by that contention(e.g., because of their frequent checkpointing). By working in concert, PRIONN and CanarIO predict the a priori knowledge necessary to prevent and mitigate IO contention with IO-aware scheduling. We integrate AI4IO in the Flux scheduler and show how A4IO produce improvements in simulation performance: we observe up to 6.2% improvement in make span of HPC job workloads, which amounts to more than 18,000 node-hours saved per week on a production-size cluster. Our work is the first step to implementing IO-aware scheduling on production HPC systems.

Prof. Gene Cooperman

Professor Cooperman currently works in high-performance computing. He received his B.S. from the University of Michigan in 1974, and his Ph.D. from Brown University in 1978. He then spent six years in basic research at GTE Laboratories. He came to Northeastern University in 1986, and has been a full professor there since 1992. His visiting research positions include a 5-year IDEX Chair of Attractivity at the University of Toulouse/CNRS in France, and sabbaticals at Concordia University, at CERN, and at Inria/France. He is one of the more than 100 co-authors on the foundational Geant4 paper, whose current citation count is 29,000. The extension of the million-line code of Geant4 to use multi-threading (Geant4-MT) was accomplished in 2014 on the basis of joint work with his PhD student, Xin Dong.

Prof. Cooperman leads the DMTCP project (Distributed Multi-Threaded CheckPointing) for transparent checkpointing. The project began in 2004, and has benefited from a series of PhD theses. Over 100 refereed publications cite DMTCP as having contributed to their research project. Prof. Cooperman's current interests center on the frontiers of extending transparent checkpointing to new architectures. The DMTCP project has been applied by others to VLSI circuit simulators, circuit verification (e.g., by Intel, Mentor Graphics, and others), formalization of mathematics, bioinformatics, network simulators, high energy physics, cyber-security, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and of course high performance computing (HPC). Prof. Cooperman is currently involved in a collaboration with NERSC to create a robust, easy-to-use platform for transparent checkpoiting for MPI (MANA sub-project) and CUDA (CRAC sub-project). This platform will be freely available to HPC sites and others, everywhere.

Here is Prof. Cooperman's abstract:

SO WHY CAN'T I CHECKPOINT THAT?

This talk has three parts: a brief history of checkpointing in supercomputing; challenges from the past (new hardware); and a proposal for the future (a general framework for checkpointing). The talk is informed by the speaker's experience with a 15-year project: DMTCP (Distributed MultiThreaded CheckPointing). The talk argues that the key to the future of checkpointing is a better understanding of boundaries and plumbing:

(i) Where can we find natural boundaries for the portion of the software that needs to be checkpointed?

(ii) How can we build more efficient plumbing to poke through these boundaries?