PhD student (M/F) - Energy and performance monitoring and models towards sustainable Exascale computing
Context
Launched in 2023 for a duration of 6 years, The NumPEx PEPR aims to contribute to the design and development of numerical methods and software components that will equip future European Exascale and post-Exascale machines. NumPEx also aims to support scientific and industrial applications in fully exploiting their potentials.
The Exa-Soft project aims at consolidating the European Exascale software ecosystem by providing a coherent, exascale-ready software stack enabling HPC applications to efficiently exploit heterogeneous supercomputers featuring heavily accelerated compute nodes. The project will achieve breakthrough research advances in programming languages and models, code optimization, runtime systems, performance profiling and analysis, and numerical libraries to address major scientific challenges.
The SEPIA team works on resource management on various distributed systems (cloud datacenters, HPC centers, edge architectures,IoT…) and is especially interested in ecological transition, notably by reducing energy consumption and CO2 emissions,by using renewable energy.
Mission
High Performance Computing usage is growing from climate science studies to chemical research. The increased impact of these computation opens the field of research on how to manage and reduce their energy consumption. In the NumPEx project we aim at developing state-of-the-art skills and infrastructures in the field of exascale computing.One of the pillars of NumPEx focuses on making exascale computing sustainable.
To make informed cluster-level scheduling decisions and to provide feedback to users, information on the whole infrastructure is needed. At any time, several applications use cluster resources. Each of these applications use there sources differently, leading to different patterns of power consumption. A high level of abstraction is needed to tackle the complexity of the large number of simultaneous applications. Several academic proofs of concept exist to simplifyand use high-level representation (including resource and power consumption) of such applications instead of timeseries of measures.
The objectives of the PhD are the following:
- Monitoring of large-scale applications that have a stable behavior using limited data: detect the behavior of thewhole application by only using the monitoring data of a small number of servers ; Change the frequency of monitoring depending on the needs.
- Modeling and caracterization of applications: detect when an application switches from one phase to another ;determine properties of the phases (whether they are io-bound, memory-bound, cpu-bound…). Software will be developed to detect and characterize phases of HPC applications during their execution.
- Model the impact of various leverages (DVFS, network and IO reconfi guration,…) on performance and energy.
The PhD structure will be as follows:
- State of the art on phase-based application models (such as https://theses.hal.science/tel-00946583)
- Experiments to acquire data on actual HPC applications on multiple hardware configurations
- Analysis of the data to build energy and performance models taking into account the hardware configuration
- Analysis of the impact of reducing the amount of acquired data
- A demonstrator using the phase detection system along with the model of leverages to drastically reduce the power consumption of HPC datacenters
Monitoring software will be used (such as MojitO/S) during the PhD, and some contributions might be done to them. A large scale experiment platform will be used (Grid’5000).
Required Skills
A Master in Computer Science is required.
A taste for experimental approaches, C or Rust programming, Python or R data analysis is strongly recommended.
A background in performance optimization, performance evaluation and modeling, usage of remote computing servers will be appreciated.
More informations
The position is located in a sector under the protection of scientific and technical potential (PPST), and therefore requires, in accordance with the regulations, that your arrival is authorized by the competent authority of the MESR.
For further information, please visit the CNRS website.