Feature selection in high-dimensional genomics data

Chloé-Agathe Azencott (CBIO, Mines Paris–PSL)

26/03/2026 11:00 - 12:00
Emplacement: Aurigny Room

High-throughput technologies now make it possible to routinely measure large number of genomics features (such as single nucleotide polymorphisms, RNA transcripts, or methylation levels) along entire genomes. This allows us to frame biomarker discovery or the construction of hypotheses to explain biological mechanisms, as problems of feature selection: which of these numerous features are relevant to explain a phenotype of interest? However, the data sets we build often pose multiple statistical challenges: they contain orders of magnitudes more features than samples, and the features exhibit a non-trivial correlation structure. In my talk, I will present how incorporating structured prior knowledge into feature selection procedures can help improve discovery. I will also show how two recent statistical framework, post-selection inference and statistical knockoffs, can provide statistical guarantees for these more complex feature selection procedures.

Relevant publications:
– Asma Nouira and Chloé-Agathe Azencott (2025) Sparse multitask group lasso for genome-wide association studies. https://pubmed.ncbi.nlm.nih.gov/40938940/
– Lotfi Slim, Clément Chatelain, and Chloé-Agathe Azencott (2022). Nonlinear post-selection inference for genome-wide association studies. https://pubmed.ncbi.nlm.nih.gov/34890162/
– Julie Cartier, Johanna Lagoas, Adeline Fermanian, Chloé-Agathe Azencott and Florian Massip (2025). Statistical knockoffs improve biomarker discovery from transcriptomic data. https://www.biorxiv.org/content/10.1101/2025.07.04.663147