Data Analysis and Visualization in R

Module IN2339

Credit: 6 ECTS.

Moodle: https://www.moodle.tum.de/enrol/index.php?id=31612

RE-EXAM:

The re-exam will be on Wednesday, April 5, 2017 from 02:00 - 04:00 pm in MI HS2 (lecture hall 2, informatics building).

When and where?

This lecture is given in the winter term:

Lectures on Tuesdays, 14:00-15:30, room 01.13.010. Starting 18.10.16.

In class exercises on Wednesdays, 14:00-17:00, two rooms in parallel. Starting 19.10.16.

Rooms: 

  • Oct 19: 00.08.055 and 01.11.018
  • Oct 26: 00.08.055 and 02.13.010
  • Nov 2: 02.13.010; first lecture then exercise (holiday on Nov 1)
  • Nov 9 till Jan 11: 00.08.055 and 02.13.010
  • Jan 18 and Jan 25: 00.08.055 and 00.11.038
  • Feb 1 and Feb 8: 00.08.055 and 02.13.010

The main exercise room will always be the latter. We are sorry for the inconvenience.

Lectures and exercises are held at the TUM-Inf, Boltzmannstr. 3, 85748 Garching

Description

This module for students in bioinformatics, master students of Data Engineering and Analytics, and master students of Biomedical Computing teaches methodologies and good practice of data science using R.

The lecture is structured into three main parts, covering the major steps of data analysis:

1. Get the data: how to fetch, and manipulate real-world datasets. How to structure them ("tidy data") to most conveniently work with them.

2. Look at the data: basic and advanced visualization techniques (grammar of graphics, unsupervised learning) will allow students to navigate and identify interesting signal in large and complex datasets and formulate hypotheses.

3. Conclude: concepts of statistical testing will allow concluding about the raised hypotheses. Also methods from supervised learning will allow to model data and build accurate predictors.

Each week, the lecture is accompanied with direct exercices, done live. During the exercise class, combinations of the concepts seen during the lecture will allows performing more involved data analysis tasks. Students generate report that embed code and analysis. Two more advanced case studies complement the course. Many examples will stem from applications in genomics, but no pre-requisite in this domain is necessary.

Required background

Experience with programming of any language. The theoretical aspects of data analysis are kept low in this module. See our companion module "Statistical modelling and machine learning" to complement it. 

Recommended reading

R for Data Science, by Garrett Grolemund and Hadley Wickham

Computer

Bring a laptop with RStudio installed, a free programming interface for the R language.

Topics

R programming basics, report generation with R markdown

Importing, cleaning and organizing data (tidy data)

Basic plotting

Grammar of graphics

Unsupervised learning (hierarchical clustering, k-means, PCA)

Drawing robust interpretations (empirical testing by sampling, classical statistical tests)

Supervised learning (regression, classification, cross-validation)

Evaluation

The final exam is a 2 hours written exam. The mark will be the one of the final exam plus bonus points for the home works and case studies.

Teaching team

This lecture is given by a team of scientists with long experience in high-dimensional data analysis in the field of genomics:

Prof. Julien Gagneur and members of his lab.

Dr. Matthias Heinig, Group leader, Institute of Computational Biology, Helmholtz center 

Dr. Jan Krumsiek, Group leader, Institute of Computational Biology, Helmholtz center