Exploratory Analysis of Biological Data using R (2013)

Course Objectives

Before we can begin to apply rigorous statistical tools to research data, we often need to approach our data intuitively, and look for meaningful associations, surprising patterns, or irregularities, to formulate hypotheses. This is commonly referred to as Exploratory Data AnalysisEDA. This workshop introduces the essential tools and strategies that are available through the free statistical workbench R. Participants should be able to modify the scripts and protocols we discuss for their research tasks, identify potential problems with their own data, and define their statistics needs for cases in which expert advice is required. Case studies with common research scenarios such as microarray data, and flow cytometry will emphasize practical skills. Writing your own R functions and analysis scripts will be introduced at the beginning of the workshop and skills will be gradually built on over the course of the lectures. Plotting and visualization is a key element of EDA and we will gradually build skills–from the elementary built-in routines via their (sometimes bewildering) array of parameters to sophisticated, publication-ready presentations.

Target Audience

Graduates, postgraduates and PIs who need to design and execute strategies for data analysis but have little or no formal prior training in statistics and /or familiarity with the R statistical workbench.
Prerequisites:
  • Your own laptop with R installed. If you do not have access to a laptop, you may loan one from CBW. Please contact course_info@bioinformatics.ca for more information.
  • Completing an online tutorial on the installation and basic use of R before the workshop.
Pre-Readings:
You need to complete our introductory R tutorial for the course beforehand. The tutorial is very accessible and designed for students who have never used R before. Please navigate to: http://www.biochemistry.utoronto.ca/steipe/R

Course Outline

Day 1

Module 1: The R Landscape (2013) (Faculty: Boris Steipe)
  • An overview of R's capabilities and how to expand them through the large, community-contributed resources such as CRAN and BioConductor how to keep abreast of best-practices
  • Reading and writing data from common biological file-formats, including numeric data, sequences, annotations, and networks
  • The difference between the various types of data objects in R and when each one is appropriate
  • Conditional selections and other filtering approaches
  • First experiments with writing R scripts
Module 2: Exploratory data analysis for biological data (2013) (Faculty: Boris Steipe)
In this module we will discuss the principles of Exploratory Data Analysis (EDA), how to compute descriptive statistical measures, how to smooth and transform data and how to visualize data using R's powerful and flexible plotting routines. Topics include:
  • EDA principles
  • Descriptive statistics: mean/median and variance, quantiles, outliers
  • Transformations and smoothing techniques (e.g. Lowess)
  • Plotting in R: basics, advanced options, special packages and best practices
Module 3: Hypothesis testing for EDA (2013) (Faculty: Boris Steipe)
  • Common statistical tests and their underlying assumptions about the data
  • p-values, distributions, Z-scores and "significance"
  • False positive and false negative error rates
  • Bootstrap and resampling techniques
  • Multiple testing corrections: Bonferroni, family wise error rate, false discovery rate
  • Non-parametric alternatives
  • Power calculation and sample size
Lab Practical: Working with your own data

Day 2

Module 4: Data reduction (2013) (Faculty: Boris Steipe)
Much of our biological data is very high-dimensional, and accordingly difficult to assess. However, powerful methods exist to simplify the problem. Topics include:
  • Visualizing multi-dimensional data
  • Data reduction with Principal Components Analysis
  • Using explicit models for data reduction
Module 5: Clustering Analysis (2013) (Faculty: Boris Steipe)
Very many clustering methods are in common use in the biological sciences and that fact alone should warn you that none is appropriate for all data under all conditions. Topics include:
  • Calculating "distance" between (high-dimensional) data points
  • Clustering principles and methods: hierarchical-, centroid-based, and information-based approaches in R
  • Assessing the quality of clustering results
  • Density estimation as an alternative>/li>
  • Outlook: classification
Module 6: Regression Analysis (2013) (Faculty: Boris Steipe)
  • Types of models for regression analysis in R
  • Linear regression
  • Calculating and plotting residuals
  • Predictions
  • Non-linear regression with arbitrary functions