Feature selection in high-dimensional genomics data

ChloĆ©-Agathe Azencott (CBIO, Mines Paris–PSL)

26/03/2026 11:00 - 12:00
Emplacement: Aurigny Room


High-throughput technologies now make it possible to routinely measure large number of genomics features (such as single nucleotide polymorphisms, RNA transcripts, or methylation levels) along entire genomes. This allows us to frame biomarker discovery or the construction of hypotheses to explain biological mechanisms, as problems of feature selection: which of these numerous features are relevant to explain a phenotype of interest? However, the data sets we build often pose multiple statistical challenges: they contain orders of magnitudes more features than samples, and the features exhibit a non-trivial correlation structure. In my talk, I will present how incorporating structured prior knowledge into feature selection procedures can help improve discovery. I will also show how two recent statistical framework, post-selection inference and statistical knockoffs, can provide statistical guarantees for these more complex feature selection procedures.

Relevant publications:
– Asma Nouira and ChloĆ©-Agathe Azencott (2025) Sparse multitask group lasso for genome-wide association studies. https://pubmed.ncbi.nlm.nih.gov/40938940/
– Lotfi Slim, ClĆ©ment Chatelain, and ChloĆ©-Agathe Azencott (2022). Nonlinear post-selection inference for genome-wide association studies. https://pubmed.ncbi.nlm.nih.gov/34890162/
– Julie Cartier, Johanna Lagoas, Adeline Fermanian, ChloĆ©-Agathe Azencott and Florian Massip (2025). Statistical knockoffs improve biomarker discovery from transcriptomic data. https://www.biorxiv.org/content/10.1101/2025.07.04.663147