Summary: According to the top500, supercomputers keeps having more and more components. Such an increase leads to 1- inscrease of the energy consumed by the supercomputers and 2- increase of the number of encountered failures.
- Reducing applications energy consumption: My work focuses on reducing the energy footprint of HPC applications using both architecture and application characteristics. For now I am studying the impact of vectorized instructions, uncore frequency and power capping on the power and energy consumption of processors.
- Fault tolerance: I worked on designing new protocols and techniques to reduce the impact of failures. These protocols are based on checkpointing and message logging. I also worked on modelling checkpointing protocols in order to compute the optimal checkpointing frequency. Finally, I took part in designing a failure detector for MPI applications.
Selected publications:
George Bosilca and Aurélien Bouteiller and Amina Guermouche and Thomas Herault and Yves Robert and Pierre Sens and Jack Dongarra. Failure Detection and Propagation in HPC systems. Super Computing (SC) 2016. [pdf]
Jean-Philippe Halimi, Benoît Pradelle, Amina Guermouche, William Jalby. FoREST-mn: Runtime DVFS Beyond Communication Slack. In IEEE Green Computing Conference (IGCC) 2014. [pdf]
Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications. In 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011). [pdf]
Modifié le 28 février 2020