LHS 610: Exploratory Data Analysis for Health

This work is licensed under a Creative Commons Attribution 4.0 International License.

Welcome to the course page for LHS 610: Exploratory Data Analysis for Health.

Real health data is complex, often unstructured, at times inaccurate, inconsistent, contains missing values, and is organized for clinical care rather than to meet analytic needs. Learning from health data requires a solid grasp of data operations, data visualization, statistics, and machine learning, as well as an understanding of ethical and legal frameworks guiding health data privacy and security. Students in this course will learn foundational topics in data science focused on health data and will apply this knowledge on real health datasets through hands-on labs integrated into the lectures. The course is based on two large themes: (a) understanding health data, and (b) making inferences based on data. Students will develop a systematic working understanding of R, one of the most widely used languages for data science, and an introductory understanding of several packages useful in analyzing health data. They will participate in a group project focused on answering a health-related question. After completing this course, students should be able to securely store a health data set, summarize its structure, merge tables, visualize relationships, reshape and subset it to meet analytic needs, deal with missing values, apply statistical and machine learning methods to build prediction models, and evaluate the performance of these models.

Module 1: Introduction to LHS 610

Module 1 Slides (pdf)

What is exploratory data analysis?

Is LHS 610 the right course for you?

Primary versus secondary use of health data

Basics of R and RStudio

Anatomy of an R Notebook

Module 2: Data Frames

Module 2 Slides (pdf)

Overview of Content

Refresher of Introductory Content and R Tips of the Day

Data Structures and Types in R

Reading Data into R

Introducing Pipes

The Verbs of Data Science

Grouping and Combining Verbs

Module 3: Tidy Data

Overview of Content

Refresher of Data Frame Verbs

R Tips of the Day - Dealing with Dates

Combining mutate() with if_else() and case_when()

Joining Data Frames

Reshaping Data with spread() and gather()

Separating and Uniting Columns

A Challenging Case of Tidying Office Hours Data

Reviewing the Old and New Verbs of Data Science

Module 4: Formulating a Health-Related Question

What Makes a Health-Related Question Important?

What Makes a Health-Related Question Answerable?

Dealing with Confounders

Bradford Hill's Criteria of Causation

Study Designs and Bias in Observational Studies

Module 5: Telling Stories with Plots in Health Data

Overview of Content

R Tips of the Day - Saving Your Workspace and the Plus Sign

Mini-Lab - Anscombe's Quartet

Principles of Visualization

Tell the Right Story

Graphics with Grammar

ggplot - Geometric Objects and Mappings

ggplot - Position, Labels, and Facets

ggplot - Coordinates, Scales, and Themes

Exploring the Relationship Between Weight and Blood Pressure

Exploring Confounders with Plots

Module 6: Hypothesis Testing

Overview of Content

R Tips of the Day - Replacing and Assigning Missing Values

What is Hypothesis Testing?

Why a Null Hypothesis? And Interpreting a P-Value

What Common Statistical Tests Should I Know?

Which Test, Which Plot?

Multiple Hypothesis Testing

Key Lessons

Module 7: Interactive Data Analysis

Overview of Content

R Tips of the Day - The magic of !!parse_expr()

Converting R Notebooks into R Markdown Documents

Live Coding - Converting an R Notebook into an R Markdown Document

Converting R Markdown Documents into Interactive Shiny Documents

Live Coding - Converting an R Markdown Document into an Interactive Document

Module 8: Introduction to Machine Learning

Unsupervised and Supervised Learning

Reinforcement Learning and ML vs Stats Terminology

Is a Predictive Model Needed and Should You Develop One?

Supervised Learning is a Curve-Fitting Exercise

Common Problems with Fitting and Applying Models

Step-by-Step Process for Training and Evaluating Models Using Tidymodels

Module 9: Supervised Learning Algorithms

Slides and content in this Module were developed by V.G. Vinod Vydiswaran, PhD. They are shared here with his permission.

What is Supervised Learning?

The Majority Baseline

Evaluation Measures for Classification Models

Decision Trees

Lazy Learners (K-Nearest Neighbors)

Support Vector Machines

Perceptrons and Neural Networks

Naive Bayes

Review of Supervised Learning Algorithms

Combining Classifiers

Module 10: Machine Learning and Missing Data

Review of Key ML Concepts and Performance Measures

The Missing Data Problem in a Nutshell

Why are the Data Missing?

Diagnosing Missing Values

Module 11: Analyzing Health Text Data

Reading in Text Data and Calculating Term Frequencies

Why Common Words are Not Useful (and How tf-idf Can Help)