Helping systems biology deliver on its promises: Deciphering gene function by using semantic analysis from heterogeneous knowledge resources

Aarón Ayllón Benítez (Umeå University)

01/04/2021 10:30 - 12:00

The ongoing development of new sequencing technologies, by strongly improving the production of omics data, and by providing an affordable cost of sequencing analysis enables researchers to better understand the relations between a collection of genes and a phenotype (observable characteristics encoded by these genes). To understand a given disease, drug or vaccine, It is useful to know what biological processes are produced by a group of genes (gene set) involved in one of those cases. The advances in differential gene expression analysis led to a strong interest in the study of gene sets with similar expression under the same experimental condition. Classical approaches to interpreting biological information of gene sets are based on the use of statistical methods. However, these statistical methods focus on the best known genes while generating information redundancy which can be eliminated by taking into account the structure of the knowledge resources providing the annotation. To address the issues of statistical methods, I developed a new method for analyzing the impact of using different semantic similarity measures on gene set annotation. An important limitation to extract functional insights from genomics studies in non-model species is the limited availability of functional gene annotation information. At present, the functional annotation of genes of Norway spruce (gymnosperm tree) relies on identifying the most sequence similar orthologous gene in Arabidopsis thaliana (model organism) and subsequently using this Arabidopsis gene description as a proxy. Unfortunately, this is error prone as well as making an often-false inference of gene function conservation between the two disparate species. For this reason, I am combining two methodologies, gene network inference and semantic similarity, to generate functional gene annotations based on a combination of pooled probabilities from an analysis of available plant genomes, and empirical probabilities derived from co-expression network analyses. While network analysis based on gene expression identifies relationships between genes on the basis of activity patterns, semantic similarity identifies relationships on the basis of the linguistic terms used to describe their function.