The bioinformatics problems that we focus on pertain to the phenomenon of “gene regulation” and its evolution. Gene regulation refers to how genes in a cell are switched on (or off) to determine the cell’s functions. It is the reason why, for example, skin and muscle cells are different despite having the same DNA. It is central to a range of biological phenomena such as development and disease. Moreover, evolution of gene regulation underlies the amazing diversity of life forms around us. We develop innovative computational methods, based on probabilistic inference, machine learning, and biophysics-inspired models, to answer unsolved and topical questions related to gene regulation in diverse biological processes. We collaborate extensively with experimental biologists for confirmation of predictions made by our models.
One of our main research goals is to develop novel approaches to identify regulatory influences on genes. Different biological processes involve different sets of genes acting together in precisely regulated patterns. Thus, to understand a biological process one needs to identify the participating genes and their regulators. We develop computational methods to analyze large biological data sets for clues about regulatory influences on specific genes.
We also strive to build powerful approaches for evolutionary comparisons of regulatory sequences. Cross-species comparison helps us understand how gene regulation and its encoding in DNA evolves. This is a fundamental open problem of biology, and is linked to the diversity of life forms. Signatures of evolution present in regulatory sequences, once characterized, can also be a powerful guide to their genome-wide identification.
Four specific directions of research in our lab are outlined below.
Most of us are aware that genes are DNA segments that encode properties of living organisms; in fact, we even know how to “read” gene sequences and interpret their function. What is less commonly appreciated is that there is another major class of DNA segments called “enhancers" that control the activities of genes, and arguably play an equally important role as genes. Indeed, 90% of known disease-related mutations in the human genome lie outside of genes, in regulatory sequences such as enhancers. However, enhancers encode molecular function in a “language” different from the genetic code, and are notoriously hard to read. Our research over the last ten years has sought to decode the “cis-regulatory code” of enhancers. This will ultimately allow us to predict consequences of small changes to non-gene DNA sequences, such as single nucleotide polymorphisms, that underlie interpersonal differences in disease susceptibility and drug response.
Enhancers function by influencing the activity (“expression”) of nearby genes. Thus, to decipher their “code” means to be able to predict the expression level of a gene from the sequence of an enhancer. Given numerous examples of enhancers and corresponding gene expression levels, we should be able to learn a function that maps enhancer sequence to expression. However, the domain of this function is vast, training data are sparse, and the function represents complex biochemical phenomena, making the inference problem daunting. Furthermore, our goal is not merely to learn any function but one that is well grounded in biological reality. This function is the main focus of our studies of the cis-regulatory code.
Cancer is a disease of the genome and scientists are applying the very best that modern genomics (as well as other “omics”) technology has to offer today to cure cancer. One such approach involves measurement of gene expression under varying experimental conditions (e.g., cancer versus healthy), identifying genes whose expression varies across those conditions, and finally determining transcription factors (TFs) that regulate those genes. Such regulatory relationships are often represented in a network form, called the “gene regulatory network” or GRN. There is tremendous interest today in GRNs underlying cancer progression as well as patient sensitivity to cancer drugs.
Animal behavior arises from the coordinated activities of cells in the brain, commonly modeled with neuronal networks. A significant body of work has shown that behavior is also shaped by the activities of genes that operate in brain cells, with significant changes in brain gene expression profiles accompanying behavioral responses to particular environmental stimuli. These findings suggest that a second layer of network biology – that of gene regulatory networks (GRNs) – also underlies behavior. Our research over the last ten years has sought to reveal GRNs associated with social behavior, a topic that is beginning to be widely appreciated in the behavior research community.
evolution of gene regulation
Enhancers, introduced above, are intriguing not only in terms of their encoding, but also from an evolutionary perspective. Major questions remain unanswered regarding their evolution, e.g., how long such evolution might have taken, and how such elements can change in sequence while maintaining their function. Characterizing patterns of evolution present in enhancers can also help discover novel enhancers in less data-rich species as well as help interpret DNA variations within a population. Driven by these goals, I have developed a variety of approaches to model enhancer evolution.
Comprehensive maps of gene regulation in various organisms. “Transcription factors” are a type of protein that bind to short segments of DNA near a gene to regulate the gene’s activity. A major challenge today is to predict binding sites (and thus the target genes) of a transcription factor in the entire genome. Such a “map” of gene regulation provides valuable clues for investigating the details of a biological process, eventually leading to high level insights about the process. We have developed a series of computational tools to meet this challenge. These tools rely on a probabilistic description (“motif”) of the transcription factor binding sites, and scan the genome for matches to the motif using statistical scores. In contrast to the common approach of finding individual above-threshold matches to the motif, we have pioneered the examination of longer DNA segments for evidence of one or more matches to the motif, without imposing any thresholds. Such evidence is quantified in the form of the likelihood of the sequence under a Hidden Markov model (HMM) describing regulatory sequences. This new manner of scoring motif matches helps reduce mispredictions, and mirrors the thermodynamic nature of the protein-DNA interactions underlying the regulatory process. We use this approach to catalog the genes targeted by each transcription factor, and then to assign biological functions to the transcription factor by examining the common functions of its target genes. This strategy has yielded significant biological findings in the context of several processes and species. For example, by analyzing the human genome we found that age-dependent changes in cells are regulated by a transcription factor called NFκB. This was then experimentally validated by our collaborators, who showed that blocking the action of this factor could rejuvenate old skin cells (in mice) to appear like younger cells (Genes & Dev. ’07, Genome Res. ’08). We used a similar method to predict transcription factors that regulate genes in the brain of the zebra finch, as it learns how to sing (Nature ’10). This songbird is a model organism for studying vertebrate brain and behavior, and for human neuroscience.
Gene regulation and social behavior One of the most noteworthy successes of our methodology was in providing the first glimpses of how genes respond to social signals received by the brain. This was the result of a collaboration with Prof. Gene Robinson, who studies socially regulated genes in the honeybee brain. We discovered potential regulators of such genes using our motif scanning methods, after adapting the methods to certain peculiarities of the honeybee genome (PNAS ’06). This and follow-up studies have demonstrated a robust association for social behavior, brain gene activity, and transcription factor binding sites. More recently, we have shown that gene regulation is also intimately linked to the evolution of a social behavior, viz., aggression in honeybees (PNAS ’09).
Cis-regulatory modules and their discovery through comparative genomics Another major thrust of our research is the genome-wide discovery of “cis-regulatory modules”, that are DNA segments harboring several binding sites in close proximity. A module encodes the combined action of multiple transcription factors that turns a gene on and off in precise patterns. Modules are hard to find because they may not lie next to the gene, and the key to their discovery is the clustering of binding sites within them. We had previously developed an HMM-based approach to quantify this property and predict module locations (ISMB 2003). Since then, we have worked on fundamental improvements to this approach through comparative genomics. Put simply, we examine not just the genome of interest, but also the genomes of closely-related species. We use inter-species genome alignments to pinpoint evolutionarily conserved segments, and exploit this information in our module discovery tools. We have developed probabilistic models of the evolution of modules, in order to do this in a principled manner. This was a major technical achievement, since no such models had previously existed, let alone the methods required for inference under the models. We combined our HMM framework with these evolutionary models to build efficient probabilistic tools that achieve improved accuracy in module prediction in the Drosophila genome (PLoS Comp. Bio. ’07, ’09). Another significant advance in our tools was to make cross-species comparisons robust to alignment errors, by efficiently summing over all possible alignments of two sequences. We also adapted these tools to be usable through a convenient web interface in real-time (NAR ’08).
Evolution of modules. The wonderful diversity of life forms we see in Nature are the result of evolutionary forces, whose footprints are visible in DNA. It is becoming increasingly clear that a majority of the morphological diversity (i.e., variations of “form”) comes from changes in gene regulation, rather than the genes themselves. We have used inter-species sequence comparison to understand how modules are shaped by evolution. We compared modules from 12 Drosophila genomes and found statistical patterns in how they differ among these species (PLoS Gen. ’09), e.g., how binding sites turn over during evolution. Such patterns can then help refine the evolutionary models used in our module discovery tools.
Alignment-free comparison of modules The ability to quantify the similarity between sequences, through alignment algorithms, has been a keystone of bioinformatics for over three decades now. However, alignment-based methods break down when similarity must be detected between sequences that are greatly diverged or that are not descended from a common ancestral sequence. Recognizing that even in these scenarios, regulatory sequences with similar function may have detectable shared features (common binding sites), we have developed a suite of alignment-free scores of sequence similarity. These are statistical scores based on frequency distributions of short words in the sequences being compared. We took special measures to incorporate the properties of regulatory sequences, e.g., variation within binding sites and deviations from genome-wide nucleotide frequencies, into these scores (ISMB 2007). Importantly, we showed how such scores could be used to discover novel modules, by scanning the genome for sequences similar to a set of known modules (Dev. Cell ’09, Genome Bio. ’08). We validated our discoveries experimentally in two model systems: fruitfly and mouse. We thus established the new paradigm of “motif blind” module discovery, making this critical task possible in the common scenario where the transcription factors relevant to a biological system are not known a priori.
Models of regulatory function. Our goal goes beyond the discovery and cataloging of modules, to the explanatory and predictive principles underlying their function, the so-called “regulatory code”. We have proposed two strategies for integrating DNA sequence information with the cellular context to predict if a module will turn a gene on or not. The first strategy is based on logistic regression. Here, gene activity is dictated by a linear combination of contributions from multiple transcription factors. Each factor’s contribution, in turn, is determined heuristically by the product of its cellular concentration and its binding affinity to the module. Our second strategy is based on principles of statistical mechanics. It models gene activity as the macroscopic behavior of an ensemble of “microstates” comprising different combinations of transcription factor molecules bound to the DNA. Both approaches have been highly successful in explaining gene activity patterns underlying development in Drosophila, a model biological system (manuscripts in review). We have used these strategies to quantify the regulatory influence of each transcription factor on a gene, and to reveal mechanistic insights into the regulatory process. We believe our quantitative models will fundamentally alter how we analyze regulatory sequences in the future.
Probabilistic Alignment The standard approach to examining the relationship among multiple sequences has been to align them using information theoretic scores. Such scores are usually not motivated by realistic models of evolution, and are instead optimized for high speed. With computing power now being orders of magnitude greater than what it was three decades ago, when the basic alignment algorithms were formulated, researchers are actively re-examining the very fundamentals of sequence alignment. We are interested in developing a more realistic framework for multiple sequence comparison, that models evolutionary events more explicitly (Bioinformatics ’07), and is aware of the organization of functional elements in the sequence (PloS Gen. ’09, PloS CB ‘09). This involves development of probabilistic models, and inference on these models using genomic data, which is often a computationally challenging task. We have also been busy developing suitable benchmarks for alignment programs (BMC Bioinf. ‘10), especially those meant for non-coding DNA sequences.