CS 598SS: Advanced Bioinformatics

Announcement (updated 5 pm, 8/19):

The first several weeks of the class will be on zoom for both sections (SS and SSO). I will decide later if this arrangement should continue beyond that.

The zoom link for the class is (you will need an illinois account to enter): https://illinois.zoom.us/j/96673460522?pwd=dEVocTBKR2RVWXJxcmdvT0tsbWJ4Zz09

If you are registered for the class (by 8/19), you should be able to log on to the course web site (see URL below).

If you are not registered but plan to register later, please email the instructor at sinhas@illinois.edu with the subject "598", requesting to be added to the course web site.


Course Information

Graduate Course in Bioinformatics. Fall 2020.

Instructor: Saurabh Sinha. http://sinhalab.net

Meets: T/Th 11 AM- 1215 PM, in 0216 Siebel Center and online. CRNs 46042 (lec/disc), 42377 (online).

Web site: https://learn.illinois.edu/course/view.php?id=51603

This is a graduate course on bioinformatics and biological data science, being offered this Fall in the Computer Science department, under the rubric CS 598SS. This course introduces a selection of topics in bioinformatics. Methodologically, it will focus on probabilistic methods and statistical analysis, as well as selected topics in machine learning. Equally important will be applications of these techniques to current topics in genomics, especially regulatory genomics and systems biology. This course is being taught as a prototype of the future course ‘ML in Bioinformatics’.

Who this is for: The course is designed for graduate students, and will help students aspiring to become bioinformatics researchers. We will discuss how to ask and answer questions about what goes on in cells, what it means, and what makes it all work, using high throughput data of various types and appropriate statistical and computational methods. The course may also be appealing to students who are interested in data sciences in general and are looking for interesting applications.

Who this is not for: The course is less ideal for students interested in a casual exposure to the buzz surrounding machine learning and bioinformatics.

Grading: Tentative break up of grade is: Project (30%), Paper reviews (20%), Paper presentation (20%), and one midterm exam (30%).

Tentative format: Majority of the sessions will be led by instructor, these will be online. Approximately ten sessions will be student-led: some of these may be held in a classroom, in which case they will also be available for online students.


overview of Topics

Introduction to Probability/Statistics. Sample topics: Bayesian Inference, Hidden Markov Models, Sampling with Markov chains.

Introduction to Machine Learning. Sample topics: Dimensionality reduction, Classification (SVM, Random Forests, CNN), regression and regularization.

Introduction to Network Analysis with Random Walks.

Introduction to Bioinformatics Problems. Sample topics: Gene regulation, enhancers, motif finding, cistromics, epigenomics, regulatory networks, single cell transcriptomics, variant interpretation, gene prioritization, pharmacogenomics.

Note: topics related to processing of next generation sequencing data (e.g., assembly, variant calling) will not be covered in this course; rather, the emphasis will be on biological discovery using data arising from high throughput technologies, in conjunction with rigorous computational methods.

Session details:

1. Basic molecular biology.

2. Basic statistics: Hypothesis testing. Null and alternative hypotheses. Difference of two groups. Parametric and non-parametric tests. Testing by permutation. Multiple hypothesis correction. Bio examples: gene expression and differential expression.

3. Probabilistic modeling: Bayesian Inference, priors, likelihood. Likelihood maximization, MAP and MP. Example: Single die model with counts. Model comparison, LRT, BIC, Bayes factor. Bio examples: DNA sequence models.

4. Transcription factor biology, motifs.

5. Bioinformatics paper: Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Motifs and Position Weight Matrices, Expectation-Maximization, Information Content.

6. Classification (general concept, k-NN, naive Bayes).

7. Classification (Linear classifiers, Logistic Regression, evaluations)

8. Bioinformatics paper: Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. ChIP data. Epigenomics. DNA accessibility and histone modifications data. Data integration.

9. Graphical models. Hidden Markov Models.

10. Bioinformatics paper: Discovery and characterization of chromatin states for systematic annotation of the human genome.

11. Classification with Random Forests.

12. Classification with Support Vector Machines.

13. Bioinformatics paper: Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. Gene expression data. Gene Regulatory Networks

14. Bioinformatics paper: A prior-based integrative framework for functional transcriptional regulatory network inference. Multi-omics data integration.

15. Clustering, including ARI.

16. Dimensionality Reduction. (PCA, tSNE)

17. Bioinformatics paper: A single-cell expression simulator guided by gene regulatory networks. Single cell RNA-seq. Differentiation.

18. Regression.

19. Random Walks

20. Bioinformatics paper: Knowledge-guided gene prioritization reveals new insights into the mechanisms of chemoresistance. Cancer pharmacogenomics. Gene prioritization. Network-guided genomics.

21. Bioinformatics paper: Characterizing gene sets using discriminative random walks with restart on heterogeneous biological networks. Network-guided genomics. Gene set membership prediction.

22. Bioinformatics paper: Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. Predict regulatory function from sequence.

23. Artificial Neural Networks

24. Deep Neural Networks

25. Variant interpretation, GWAS, eQTL.

26. Bioinformatics paper: Predicting effects of noncoding variants with deep learning– based sequence model.

27. Bioinformatics paper: Principled Multi-Omic Analysis Reveals Gene Regulatory Mechanisms Of Phenotype Variation

28. Evolutionary models.