Internship on Fault-tolerant scheduling strategies for iterative algorithms

Anne Benoit and Yves Robert are looking for their next M2 student.

Internship

Applications should be sent to Anne Benoit ([email protected]).

General informations

This internship is expected to continue with a PhD thesis, for which funding from the NumPEx PEPR program is already secured.

The successful candidate will join the ROMA team, at the ENS Lyon in the LIP laboratory.

Laboratoire de l’Informatique du Parallélisme
ENS de Lyon
46, allée d’Italie
69364 LYON CEDEX 07

Duration: 6 months

Starting date: Spring, 2025 (with a PhD thesis starting Fall 2025).

The Internship (and PhD thesis) will be supervised by Anne Benoit and Yves Robert. Co-advising will also come from Emmanuel Agullo.

How to apply

Applications should be sent to Anne Benoit ([email protected]). They should include in a single PDF file:

A motivation letter
A Detailed CV
An academic transcripts (bachelor/master)
One-page summary of the research work conducted so far, and list of publications (if any)
At least two recommendation letters.

Context

Launched in 2023 for a duration of 6 years, The NumPEx PEPR aims to contribute to the design and development of numerical methods and software components that will equip future European Exascale and post-Exascale machines. NumPEx also aims to support scientific and industrial applications in fully exploiting their potentials.

Several error sources may impact the execution of iterative algorithms on large-scale platforms. They
include fail-stop errors, that are immediately detected, and silent errors (a..a silent data corruptions), that can be detected through some verification mechanism. Fail-stop errors correspond to permanent failures, e.g., processor crashes. Silent errors are disruptions that strike and stay undetected until they manifest eventually through strange application behavior. Silent errors arise from two main sources: computation errors and memory bit-flips.
Protecting algorithms and software libraries from all these errors is a major concern within the HPC community.
The standard way to deal with fail-stop errors is checkpoint-restart, and the optimal checkpointing period is wellknown, at least for memoryless IID error inter-arrival times. However, mitigating the impact of silent errors remains an open challenge. On the one hand, replication (or even triplication to avoid a sequential re-execution) does a perfect job but at a prohibitive cost. On the other hand, numerous application-specific detectors have been introduced, such as Algorithm-Based Fault Tolerance (ABFT) checksums, recomputing a residual, checking
orthogonality of some vectors, applying space and time filters across a neighborhood, etc. These detectors are usually limited to a particular error type. A major problem is that they may well either fail to detect some errors, or raise many false alarms. In other words, these detectors are not perfect: their recall and precision are not at 100%. Most, if not all published works assume perfect detectors, which is not realistic.

Mission

The first (and main) objective of this internship is to design and assess scheduling strategies based
upon a combination of checkpoints and imperfect detectors to guarantee protection from a single source of silent errors with a high probability. This requires to introduce some assumptions, such as upper bounding the latency of the detection, or to introduce randomized tests on the data.
The second step (that may come later during a PhD) is to provide a resilient holistic methodology to protect
iterative algorithms from all error types, namely fail-stop errors and all sources of silent errors.
This internship is expected to continue with a PhD thesis, for which funding from the NumPEx PEPR program is already secured.

Required Skills

Some knowledge in algorithm design, complexity, and probabilities.
The work is on the algorithmic side of the problem, with potential simulations to validate the results.

More informations and references

For further information, please contact Anne Benoit ([email protected]), Yves Robert ([email protected]) or Emmanuel Agullo ([email protected]).

Internship on Fault-tolerant scheduling strategies for iterative algorithms

Internship on Fault-tolerant scheduling strategies for iterative algorithms

Internship

General informations

How to apply

Context

Mission

Required Skills

More informations and references

More informations and references

General Information