LHS 610: Exploratory Data Analysis for Health
This work is licensed under a Creative Commons Attribution 4.0 International License.
Welcome to the course page for LHS 610: Exploratory Data Analysis for Health.
Real health data is complex, often unstructured, at times inaccurate, inconsistent, contains missing values, and is organized for clinical care rather than to meet analytic needs. Learning from health data requires a solid grasp of data operations, data visualization, statistics, and machine learning, as well as an understanding of ethical and legal frameworks guiding health data privacy and security. Students in this course will learn foundational topics in data science focused on health data and will apply this knowledge on real health datasets through hands-on labs integrated into the lectures. The course is based on two large themes: (a) understanding health data, and (b) making inferences based on data. Students will develop a systematic working understanding of R, one of the most widely used languages for data science, and an introductory understanding of several packages useful in analyzing health data. They will participate in a group project focused on answering a health-related question. After completing this course, students should be able to securely store a health data set, summarize its structure, merge tables, visualize relationships, reshape and subset it to meet analytic needs, deal with missing values, apply statistical and machine learning methods to build prediction models, and evaluate the performance of these models.
Course Materials
Tutorial
Tutorial page: http://rcode.run/lhs_610_tutorial
If the tutorial webpage is down, please let me know!
Datasets
namcs08.RData -- National Ambulatory Medical Care Survey (NAMCS)
Module 1: Introduction to LHS 610
Slides
Videos
1-1 What is exploratory data analysis? (28 mins)
1-2 Is LHS 610 the right course for you? (21 mins)
1-3 Primary versus secondary use of health data (18 mins)
1-4 Basics of R and RStudio (5 mins)
1-5 Anatomy of an R Notebook (2 mins)
<< Note: Need to record a new video introducing students to key functions -- on last slide above>>
Module 3: Tidy Data
Slides
Videos
3-1 Overview of Content (1 min)
3-2 Refresher of Data Frame Verbs (2 mins)
3-3 R Tips of the Day - Dealing with Dates (6 mins)
3-4 Combining mutate() with if_else() and case_when() (6 mins)
3-5 Joining Data Frames (15 mins)
3-6 Reshaping Data with spread() and gather() (21 mins)
3-7 Separating and Uniting Columns (6 mins)
3-8 A Challenging Case of Tidying Office Hours Data (13 mins)
3-9 Reviewing the Old and New Verbs of Data Science (3 mins)
Module 4: Formulating a Health-Related Question
Slides
Videos
4-1 What Makes a Health-Related Question Important? (10 mins)
4-2 What Makes a Health-Related Question Answerable? (15 mins)
4-3 Dealing with Confounders (8 mins)
4-4 Bradford Hill's Criteria of Causation (12 mins)
4-5 Study Designs and Bias in Observational Studies (4 mins)
Module 5: Telling Stories with Plots in Health Data
Slides
Videos
5-2 R Tips of the Day - Saving Your Workspace and the Plus Sign (7 mins)
5-3 Mini-Lab - Anscombe's Quartet (4 mins)
5-4 Principles of Visualization (14 mins)
5-5 Tell the Right Story (6 mins)
5-6 Graphics with Grammar (17 mins)
5-7 ggplot - Geometric Objects and Mappings (13 mins)
5-8 ggplot - Position, Labels, and Facets (12 mins)
5-9 ggplot - Coordinates, Scales, and Themes (14 mins)
5-10 Exploring the Relationship Between Weight and Blood Pressure (9 mins)
Module 6: Hypothesis Testing
Slides
Videos
6-1 Overview of Content (2 mins)
6-2 R Tips of the Day - Replacing and Assigning Missing Values (22 mins)
6-3 What is Hypothesis Testing? (10 mins)
6-4 Why a Null Hypothesis? And Interpreting a P-Value (10 mins)
6-5 What Common Statistical Tests Should I Know? (6 mins)
6-6 Which Test, Which Plot? (31 mins)
6-7 Multiple Hypothesis Testing (5 mins)
Module 7: Interactive Data Analysis
Slides
Videos
7-1 Overview of Content (4 mins)
7-2 R Tips of the Day - The magic of !!parse_expr() (9 mins)
7-3 Converting R Notebooks into R Markdown Documents (26 mins)
7-4 Live Coding - Converting an R Notebook into an R Markdown Document (11 mins)
7-5 Converting R Markdown Documents into Interactive Shiny Documents (36 mins)
7-6 Live Coding - Converting an R Markdown Document into an Interactive Document (20 mins)
Module 8: Introduction to Machine Learning
Slides
Videos
8-1 Unsupervised and Supervised Learning (13 mins)
8-2 Reinforcement Learning and ML vs Stats Terminology (9 mins)
8-3 Is a Predictive Model Needed and Should You Develop One? (15 mins)
8-4 Supervised Learning is a Curve-Fitting Exercise (23 mins)
8-5 Common Problems with Fitting and Applying Models (25 mins)
8-6 Step-by-Step Process for Training and Evaluating Models Using Tidymodels (29 mins)
Module 9: Supervised Learning Algorithms 1
Slides and content in this Module were developed by V.G. Vinod Vydiswaran, PhD. They are shared here with his permission.
Slides
Videos
9-1 What is Supervised Learning? (20 mins)
9-2 The Majority Baseline (8 mins)
Module 10: Machine Learning and Missing Data
Slides
Videos
10-1 Review of Key ML Concepts and Performance Measures (15 mins)
10-2 The Missing Data Problem in a Nutshell (5 mins)
Module 11: Supervised Learning Algorithms 2
Slides and content in this Module were developed by V.G. Vinod Vydiswaran, PhD. They are shared here with his permission.
Slides
Videos
11-1 Support Vector Machines (22 mins)
11-2 Perceptrons and Neural Networks (14 mins)
Module 12: Machine Learning in Clinical Practice
Slides
None
Videos
None
Papers
Coming soon.
Module 13: Analyzing Health Text Data
Slides
Videos
13-1 Reading in Text Data and Calculating Term Frequencies (27 mins)
13-2 Why Common Words are Not Useful (and How tf-idf Can Help) (25 mins)