COMP-364: Tools for the Life Sciences

 

MTR 10:35am-11:25am

ENGTR – Trottier Building 2120
Jan 7th 2015 – April 15th 2016
Prof. Mike Hallett
Location: McIntyre Building 903
Office Hours: Tuesday 9:30am @ Trottier lobby
Website 


Bioinformatics is the use of computation, statistics and mathematics to investigate problems and test hypotheses in biological systems and disease. This course aims to provide students from the life sciences and clinical studies (e.g. biology, cell biology, biochemistry, immunology, physiology) with instruction in the basics techniques of bioinformatics. The course makes extended use of bioinformatic applications related to breast cancer, since this disease has been extensively investigated using modern genomics and there is a rich toolkit of bioinformatic methods here.

The course assumes no previous experience in computer science, statistics or genomics, although a cursory knowledge certainly would assist here. Regardless, students will leave this course with the ability to program in R, a computer language specifically designed for statistics with a long history of application in Bioinformatics. Students will learn specific techniques in the analysis of DNA information (single nucleotide polymorphisms, copy number variations, chromosomal aberrations, association studies), RNA expression (class discovery, class distinction, class prediction), pathway analysis, survival analysis, and integration of different levels of gene and post-gene regulation. Within these applications, students will be introduced to some genomic technologies such as Next Generation Sequencing (NGS) with emphasis on DNA-, exome- and RNA-seq, microarrays, and protein expression arrays.

All of these concepts from bioinformatics are developed using tools from computation and statistics including programming, optimization, hypothesis testing, probabilistic models, association tests, and some simple basic statistical tests. Additional concepts from computer science include, basics of programming, software versioning systems (eg GIT), cloud computing, recursion, and introductory aspects of algorithm design.


Teaching Assistant
Mohamed Ghadie
Office Hours: Thursday 1pm, McConnell 333


Information regarding computer infrastructure for the course:

The slides for M1 L3 (Installing R, RStudio, GIT and accessing BitBucket) are here.


 


Course Notes

Links to Software and Tutorials

Textbooks, Manuals and On-line Courses (electronic)

Video Series and On-line Courses:

Data

Related and Alternative Softwares


Course Evaluation:

  • Assignment 0 (due Feb 5th, 2016) 10% of overall grade.
  • Assignment 1 (due February 25th, 2016) 10% of overall grade
  • Assignment 2 (due March17th, 2016) 10% of overall grade
  • Midterm in class (March 10th) 20% of overall grade
  • Assignment 3 (March 29th) 10% of overall grade
  • Assignment 4 (April 18th) 10% of overall grade
  • Final Exam  30% of overall grade

Module 1 – The Basics and Programming.

Lecture 1: What is bioinformatics? And some basic resources.

Links to related material:


Lecture 2: Breast cancer informatics: the example for the course.

Links to related material:


Lecture 3: R, RStudio, and a Unix Primer

Lecture 4: R basics, data types, operators, c, sets, vectors, matrices, arrays, lists

Lecture 5: R factors, data.frames, conditional execution, looping

 Lecture 6: Gene expression in R
You will need hucMini.R available through GIT for M1.L6

Links to related material:

Lecture 7: Exploration of prognostic value of TP53 expression in breast cancer

  1. differential expression
  2. patient clinical outcome
  3. descriptive statistics
  4. hypothesis testing: t-test, Wilcoxon, Kolmogorov-Smirnoff tests in R

Lecture 8: R probability distributions

  1. Example using the normal distribution of dnorm, cnorm, qnorm, rnorm.
  2. Simple plotting: hist, plot, lines

Lecture 9: R functions, scoping and algorithmics

  1. Writing functions
  2. Variable scoping
  3. Top down recursive approaches (examples)
  4. Bottom up dynamic programming approaches (examples)

Lecture 10: In/out-put; packages; Bioconductor

  1. Reading and writing to and from R
  2. R libraries and packages
  3. The Bioconductor Project

Module 2 – RNA Level Analysis of Breast Carcinoma

Lecture 1: Class Discovery –  Discovery subtypes in breast cancer mRNA expression data.

  1. Gene clusters
  2. Patient clusters: M1.L2 Breast Cancer Subtypes
  3. Distant measures (Eucliean & Pearson Correlation Distance)
  4. k-means algorithm (Why multivariate (gene) analysis has clear advantages over single gene analysis)
  5. hierarchical clustering.
  6. Visualization through heatmaps.

You will need hucMini.R (heatmap.simple() code) and k-means.eg.R for this lecture (from GIT).

Lecture 2: Class Prediction – Classifying patient clinical outcome.

You will need naiveBayes.R for this (from GIT).

  • Centroid-based methods
  • Naive Bayes’ classifiers
  • Cross-validation
  • Confounding in predictions

 Lecture 3: Measuring performance: Brief introduction to survival analysis

  • True/False Negative/Positive
  • Accuracy; Product of Accuracy
  • Kaplan-Meier
  • log-rank test

Lecture 4: Pathway Analysis

  • Hypergeometric test
  • Fisher’s Exact Test
  • GSEA/MSigDB
  • Kolmogorov-Smirnoff Test

Module 3 – DNA Level Investigations of Breast Carcinoma

Lecture 1: Cancer Genomes and Next Generation Sequencing (NGS)

Links to bioinformatics tools:

Links to relevant genomics:

Links to relevant biology and medicine:

  • Course from C Kim, K Haigis (MIT): Cancer.
  • Moncunill V et al. (2014) Comprehensive characterization of complex structural variations by directly comparing genome sequence reads. Nature Biotechnology. 32, 1106-1112. PMID:
  • Helleday T, Eshtad S, Nik-Zainal S (2014) Mechanisms underlying mutational signatures in human cancers. Nature Review Genetics 15, 585-598.

Lecture 2: Germline and Somatic Variations


  • MuTect
  • VarScan
  • SNVMix, JointSNVMix

Lecture 3: RNA-seq.

Lecture 4: Tumoral Heterogeneity, Clonal Complexity and Evolution.

Links to related material:

  • molecular evolution
  • tumoral evolution
  • tumoral heterogeneity
  • tumoral phylogenies

 

If you have a problem  on a Mac (El Capitan) when you try to pull from BitBucket and getan error message with the word ssh-askpass, do the following (best solution I could figure out so far):

1. In your home directory, save the following file: ssh-askpass

2. In a terminal window on your Mac, do cd ~  (change directory into your home)

3. chmod +rx ssh-askpass

4. When you open RStudio, in the R session, type the following:

Sys.setenv(SSH_ASKPASS=”/Users/yournamehere/ssh-askpass”)

You have to change “yournamehere” to your name… to find this out, in the terminal window type whoami

 

 

On Mac El Capitan, in order to use git (which is installed on the machine by default), you might have to do the following. Under the folder Applications, then choose Utilities, then open the terminal app. At the command line, in this window, enter the following command:

xcode-select –install

Now, when you type git –version in the terminal window, you show get back something like

git version 2.5.4 (Apple Git-61)

On Linux or Macs (Unix), create a directory called repos. To do this, in the terminal window type

cd ~
mkdir repos

In RStudio, under the Preferences/General menu, set your default working directory to ~/repos

Under the Preferences/Sweave menu, set “Weave Rnw files using” to knitr. This should allow you to see the course notes in RStudio.