Internship on Fault-tolerant scheduling strategies for iterative algorithms
Anne Benoit and Yves Robert are looking for their next M2 student.
Context
Launched in 2023 for a duration of 6 years, The NumPEx PEPR aims to contribute to the design and development of numerical methods and software components that will equip future European Exascale and post-Exascale machines. NumPEx also aims to support scientific and industrial applications in fully exploiting their potentials.
Several error sources may impact the execution of iterative algorithms on large-scale platforms. They
include fail-stop errors, that are immediately detected, and silent errors (a..a silent data corruptions), that can be detected through some verification mechanism. Fail-stop errors correspond to permanent failures, e.g., processor crashes. Silent errors are disruptions that strike and stay undetected until they manifest eventually through strange application behavior. Silent errors arise from two main sources: computation errors and memory bit-flips.
Protecting algorithms and software libraries from all these errors is a major concern within the HPC community.
The standard way to deal with fail-stop errors is checkpoint-restart, and the optimal checkpointing period is wellknown, at least for memoryless IID error inter-arrival times. However, mitigating the impact of silent errors remains an open challenge. On the one hand, replication (or even triplication to avoid a sequential re-execution) does a perfect job but at a prohibitive cost. On the other hand, numerous application-specific detectors have been introduced, such as Algorithm-Based Fault Tolerance (ABFT) checksums, recomputing a residual, checking
orthogonality of some vectors, applying space and time filters across a neighborhood, etc. These detectors are usually limited to a particular error type. A major problem is that they may well either fail to detect some errors, or raise many false alarms. In other words, these detectors are not perfect: their recall and precision are not at 100%. Most, if not all published works assume perfect detectors, which is not realistic.
Mission
The first (and main) objective of this internship is to design and assess scheduling strategies based
upon a combination of checkpoints and imperfect detectors to guarantee protection from a single source of silent errors with a high probability. This requires to introduce some assumptions, such as upper bounding the latency of the detection, or to introduce randomized tests on the data.
The second step (that may come later during a PhD) is to provide a resilient holistic methodology to protect
iterative algorithms from all error types, namely fail-stop errors and all sources of silent errors.
This internship is expected to continue with a PhD thesis, for which funding from the NumPEx PEPR program is already secured.
Required Skills
Some knowledge in algorithm design, complexity, and probabilities.
The work is on the algorithmic side of the problem, with potential simulations to validate the results.
More informations and references
For further information, please contact Anne Benoit ([email protected]), Yves Robert ([email protected]) or Emmanuel Agullo ([email protected]).