ARN - Algorithms for the identification of Genetic Regulatory Networks

Principal Investigator of the IST team: Isabel Sá-Correia

Contract: PTDC/EIA/67722/2006

Start date: 01.01.2008

Duration: 36 months


The next advances in the understanding of biological systems depend critically on the identification of gene control mechanisms and regulatory networks. Information needed for this task comes from four main sources: genomic sequence data, whole-genome measures of gene expression obtained using microarrays and quantitative proteomics, structural information (proteins, RNA and DNA) and biological literature. Using this information to infer network structures has emerged as the only realistic avenue that can be pursued in order to address the challenge of identifying, mapping and documenting the complex architectures of gene regulatory networks of a living organism. The central goal of this project is the development of methods that will partially automate the identification of mechanisms that control gene expression. Cellular processes are regulated by interactions between various types of molecules such as proteins, DNA, RNA and metabolites. Among these, the interactions between proteins and the interactions between transcription factors and their target genes play a prominent role, controlling the activity of proteins and the expression levels of genes. A significant number of such interactions has been revealed recently by means of high-throughput technologies. Moreover, recent discoveries have highlighted the regulatory roles of small functional RNA motifs in the control of gene expression. This work aims at obtaining first a better understanding of the biochemistry of molecular recognition and then accurately introducing this understanding into the mathematical models used for the inference procedure. By putting all these interactions together, one can build a network of interactions and thus describe the circuitry responsible for a variety of cellular processes. However, obtaining good models of a network of signals or of interacting genes, is a particularly difficult problem, both because we lack general knowledge of all the biological processes at play, and because we have at our disposal a growing collection of heterogeneous data based on very specialized knowledge. Integration of this diversity of data requires expertise in different types of computational techniques such as text algorithms, combinatorial algorithms, machine learning and statistics as well as graph theory. Teams that have expertise in such diverse areas need to cooperate intimately, to address and solve these challenges. To achieve all the objectives of this project, the work will be structured into seven tasks. The first task is aimed at devising new algorithms for motif inference based on an accurate model for transcriptional DNA sequence signals. The second is focused on the development of new algorithms and models for predicting small functional RNA motifs. The third task is devoted to the development of novel algorithms for gene expression analysis, obtained from microarray data and quantitative proteomics. The fourth task is devoted to the development of text-mining tools that will automatically identify gene regulations in the BioLiterature and to the development of new methods for inferring gene regulations from groups of gene annotations. The fifth task is dedicated to the integration of heterogeneous data for inference of large scale genetic networks. The sixth task is dedicated to the experimental support and validation of genetic networks inferred from bioinformatics analyses. The seventh task is the integration into an information system of the computational tools obtained in the previous tasks, and will support the work of researchers interested in genomics and regulatory networks. Given the interdisciplinary nature of the endeavor, this project brings together specialists from computer science (KDBIO/INESC-ID, XLDB/LaSIGE/FCUL, HELIX / INRIA Rhône-Alpes and BIA/INRA Toulouse ) and biology (BSRG/IBQF-Lisboa/IST). The INESC-ID team will focus on the algorithmic aspects of the problem, applying their expertise to the extraction of knowledge from genome sequence information and experimental gene activation data. The FCL team will focus on information integration techniques for automatically identifying gene regulations from publicly available data sources, such as the Gene Ontology and BioLiterature (a shorter designation for the biological and biomedical scientific literature). The IST team will contribute with their knowledge in the biological issues, leveraging the algorithmic contributions of the INESC-ID and FCL teams. Aside from these three teams, this research will be pursued in cooperation with top international research centers, in some cases as a byproduct of ongoing cooperation work. It is worthwhile stressing that the teams that will be involved in this project are leaders in the areas needed to address these challenges. They possess unique competences in the fields of motif and gene expression analysis and small RNAs structure determination. Finally, releasing to the scientific community the obtained results, in a way that makes them directly usable, is another important goal of this project.