Research Overview

The bioinformatics problems that we focus on pertain to the phenomenon of “gene regulation” and its evolution. Gene regulation refers to how genes in a cell are switched on (or off) to determine the cell’s functions. It is the reason why, for example, skin and muscle cells are different despite having the same DNA. It is central to a range of biological phenomena such as development and disease. Moreover, evolution of gene regulation underlies the amazing diversity of life forms around us.

We explore two broad research areas related to gene regulation:

(1) Identifying regulatory influences on genes: Different biological processes involve different sets of genes acting together in precisely regulated patterns. Thus, to understand a biological process one needs to identify the participating genes and their regulators. We develop computational methods to analyze large biological data sets for clues about regulatory influences on specific genes, by finding DNA segments (“regulatory sequences”) mediating such influence.

(2) Evolutionary comparisons of regulatory sequences: Cross-species comparison helps us understand how gene regulation and its encoding in DNA evolves. This is a fundamental open problem of biology, and is linked to the diversity of life forms. Signatures of evolution present in regulatory sequences, once characterized, can also be a powerful guide to their genome-wide identification.

Specific problems/topics that we are working on include: 

Comprehensive maps of gene regulation in various organisms. “Transcription factors” are a type of protein that bind to short segments of DNA near a gene to regulate the gene’s activity. A major challenge today is to predict binding sites (and thus the target genes) of a transcription factor in the entire genome. Such a “map” of gene regulation provides valuable clues for investigating the details of a biological process, eventually leading to high level insights about the process. We have developed a series of computational tools to meet this challenge. These tools rely on a probabilistic description (“motif”) of the transcription factor binding sites, and scan the genome for matches to the motif using statistical scores. In contrast to the common approach of finding individual above-threshold matches to the motif, we have pioneered the examination of longer DNA segments for evidence of one or more matches to the motif, without imposing any thresholds. Such evidence is quantified in the form of the likelihood of the sequence under a Hidden Markov model (HMM) describing regulatory sequences. This new manner of scoring motif matches helps reduce mispredictions, and mirrors the thermodynamic nature of the protein-DNA interactions underlying the regulatory process. We use this approach to catalog the genes targeted by each transcription factor, and then to assign biological functions to the transcription factor by examining the common functions of its target genes. This strategy has yielded significant biological findings in the context of several processes and species. For example, by analyzing the human genome we found that age-dependent changes in cells are regulated by a transcription factor called NFκB. This was then experimentally validated by our collaborators, who showed that blocking the action of this factor could rejuvenate old skin cells (in mice) to appear like younger cells (Genes & Dev. ’07, Genome Res. ’08). We used a similar method to predict transcription factors that regulate genes in the brain of the zebra finch, as it learns how to sing (Nature ’10). This songbird is a model organism for studying vertebrate brain and behavior, and for human neuroscience. 

Gene regulation and social behavior One of the most noteworthy successes of our methodology was in providing the first glimpses of how genes respond to social signals received by the brain. This was the result of a collaboration with Prof. Gene Robinson, who studies socially regulated genes in the honeybee brain. We discovered potential regulators of such genes using our motif scanning methods, after adapting the methods to certain peculiarities of the honeybee genome (PNAS ’06). This and follow-up studies have demonstrated a robust association for social behavior, brain gene activity, and transcription factor binding sites. More recently, we have shown that gene regulation is also intimately linked to the evolution of a social behavior, viz., aggression in honeybees (PNAS ’09). 

Cis-regulatory modules and their discovery through comparative genomics Another major thrust of our research is the genome-wide discovery of “cis-regulatory modules”, that are DNA segments harboring several binding sites in close proximity. A module encodes the combined action of multiple transcription factors that turns a gene on and off in precise patterns. Modules are hard to find because they may not lie next to the gene, and the key to their discovery is the clustering of binding sites within them. We had previously developed an HMM-based approach to quantify this property and predict module locations (ISMB 2003). Since then, we have worked on fundamental improvements to this approach through comparative genomics. Put simply, we examine not just the genome of interest, but also the genomes of closely-related species. We use inter-species genome alignments to pinpoint evolutionarily conserved segments, and exploit this information in our module discovery tools. We have developed probabilistic models of the evolution of modules, in order to do this in a principled manner. This was a major technical achievement, since no such models had previously existed, let alone the methods required for inference under the models. We combined our HMM framework with these evolutionary models to build efficient probabilistic tools that achieve improved accuracy in module prediction in the Drosophila genome (PLoS Comp. Bio. ’07, ’09). Another significant advance in our tools was to make cross-species comparisons robust to alignment errors, by efficiently summing over all possible alignments of two sequences. We also adapted these tools to be usable through a convenient web interface in real-time (NAR ’08). 

Evolution of modules. The wonderful diversity of life forms we see in Nature are the result of evolutionary forces, whose footprints are visible in DNA. It is becoming increasingly clear that a majority of the morphological diversity (i.e., variations of “form”) comes from changes in gene regulation, rather than the genes themselves. We have used inter-species sequence comparison to understand how modules are shaped by evolution. We compared modules from 12 Drosophila genomes and found statistical patterns in how they differ among these species (PLoS Gen. ’09), e.g., how binding sites turn over during evolution. Such patterns can then help refine the evolutionary models used in our module discovery tools. 

Alignment-free comparison of modules The ability to quantify the similarity between sequences, through alignment algorithms, has been a keystone of bioinformatics for over three decades now. However, alignment-based methods break down when similarity must be detected between sequences that are greatly diverged or that are not descended from a common ancestral sequence. Recognizing that even in these scenarios, regulatory sequences with similar function may have detectable shared features (common binding sites), we have developed a suite of alignment-free scores of sequence similarity. These are statistical scores based on frequency distributions of short words in the sequences being compared. We took special measures to incorporate the properties of regulatory sequences, e.g., variation within binding sites and deviations from genome-wide nucleotide frequencies, into these scores (ISMB 2007). Importantly, we showed how such scores could be used to discover novel modules, by scanning the genome for sequences similar to a set of known modules (Dev. Cell ’09, Genome Bio. ’08). We validated our discoveries experimentally in two model systems: fruitfly and mouse. We thus established the new paradigm of “motif blind” module discovery, making this critical task possible in the common scenario where the transcription factors relevant to a biological system are not known a priori. 

Models of regulatory function. Our goal goes beyond the discovery and cataloging of modules, to the explanatory and predictive principles underlying their function, the so-called “regulatory code”. We have proposed two strategies for integrating DNA sequence information with the cellular context to predict if a module will turn a gene on or not. The first strategy is based on logistic regression. Here, gene activity is dictated by a linear combination of contributions from multiple transcription factors. Each factor’s contribution, in turn, is determined heuristically by the product of its cellular concentration and its binding affinity to the module. Our second strategy is based on principles of statistical mechanics. It models gene activity as the macroscopic behavior of an ensemble of “microstates” comprising different combinations of transcription factor molecules bound to the DNA. Both approaches have been highly successful in explaining gene activity patterns underlying development in Drosophila, a model biological system (manuscripts in review). We have used these strategies to quantify the regulatory influence of each transcription factor on a gene, and to reveal mechanistic insights into the regulatory process. We believe our quantitative models will fundamentally alter how we analyze regulatory sequences in the future. 

Probabilistic Alignment The standard approach to examining the relationship among multiple sequences has been to align them using information theoretic scores. Such scores are usually not motivated by realistic models of evolution, and are instead optimized for high speed. With computing power now being orders of magnitude greater than what it was three decades ago, when the basic alignment algorithms were formulated, researchers are actively re-examining the very fundamentals of sequence alignment. We are interested in developing a more realistic framework for multiple sequence comparison, that models evolutionary events more explicitly (Bioinformatics ’07), and is aware of the organization of functional elements in the sequence (PloS Gen. ’09, PloS CB ‘09). This involves development of probabilistic models, and inference on these models using genomic data, which is often a computationally challenging task. We have also been busy developing suitable benchmarks for alignment programs (BMC Bioinf. ‘10), especially those meant for non-coding DNA sequences.