LHS 610: Exploratory Data Analysis for Health

This work is licensed under a Creative Commons Attribution 4.0 International License.

Welcome to the course page for LHS 610: Exploratory Data Analysis for Health.

Real health data is complex, often unstructured, at times inaccurate, inconsistent, contains missing values, and is organized for clinical care rather than to meet analytic needs. Learning from health data requires a solid grasp of data operations, data visualization, statistics, and machine learning, as well as an understanding of ethical and legal frameworks guiding health data privacy and security. Students in this course will learn foundational topics in data science focused on health data and will apply this knowledge on real health datasets through hands-on labs integrated into the lectures. The course is based on two large themes: (a) understanding health data, and (b) making inferences based on data. Students will develop a systematic working understanding of R, one of the most widely used languages for data science, and an introductory understanding of several packages useful in analyzing health data. They will participate in a group project focused on answering a health-related question. After completing this course, students should be able to securely store a health data set, summarize its structure, merge tables, visualize relationships, reshape and subset it to meet analytic needs, deal with missing values, apply statistical and machine learning methods to build prediction models, and evaluate the performance of these models.

Course Materials

Module 1: Introduction to LHS 610

Module 2: Data Frames

Module 3: Tidy Data

Module 4: Formulating a Health-Related Question

Module 5: Telling Stories with Plots in Health Data

Module 6: Hypothesis Testing

Module 7: Interactive Data Analysis

Module 8: Introduction to Machine Learning

Module 9: Supervised Learning Algorithms 1

Module 10: Machine Learning and Missing Data

Module 11: Supervised Learning Algorithms 2

Module 12: Machine Learning in Clinical Practice

Module 13: Analyzing Health Text Data

Course Materials

Tutorial

Tutorial page: http://rcode.run/lhs_610_tutorial

If the tutorial webpage is down, please let me know!

Datasets

namcs08.RData -- National Ambulatory Medical Care Survey (NAMCS)

Module 1: Introduction to LHS 610

Slides

Module 1 Slides

Videos

1-1 What is exploratory data analysis? (28 mins)

1-2 Is LHS 610 the right course for you? (21 mins)

1-3 Primary versus secondary use of health data (18 mins)

1-4 Basics of R and RStudio (5 mins)

1-5 Anatomy of an R Notebook (2 mins)

<< Note: Need to record a new video introducing students to key functions -- on last slide above>>

Module 2: Data Frames

Slides

Module 2 Slides

Videos

2-1 Overview of Content (4 mins)

2-2 Refresher of Introductory Content and R Tips of the Day (9 mins)

2-3 Data Structures and Types in R (8 mins)

2-4 Reading Data into R (5 mins)

2-5 Introducing Pipes (10 mins)

2-6 The Verbs of Data Science (25 mins)

2-7 Grouping and Combining Verbs (15 mins)

Module 7: Interactive Data Analysis

Slides

Module 7 Slides

Videos

7-1 Overview of Content (4 mins)

7-2 R Tips of the Day - The magic of !!parse_expr() (9 mins)

7-3 Converting R Notebooks into R Markdown Documents (26 mins)

7-4 Live Coding - Converting an R Notebook into an R Markdown Document (11 mins)

7-5 Converting R Markdown Documents into Interactive Shiny Documents (36 mins)

7-6 Live Coding - Converting an R Markdown Document into an Interactive Document (20 mins)

Module 8: Introduction to Machine Learning

Slides

Module 8 Slides

Videos

8-1 Unsupervised and Supervised Learning (13 mins)

8-2 Reinforcement Learning and ML vs Stats Terminology (9 mins)

8-3 Is a Predictive Model Needed and Should You Develop One? (15 mins)

8-4 Supervised Learning is a Curve-Fitting Exercise (23 mins)

8-5 Common Problems with Fitting and Applying Models (25 mins)

8-6 Step-by-Step Process for Training and Evaluating Models Using Tidymodels (29 mins)

Module 9: Supervised Learning Algorithms 1

Slides and content in this Module were developed by V.G. Vinod Vydiswaran, PhD. They are shared here with his permission.

Slides

Module 9 Slides

Videos

9-1 What is Supervised Learning? (20 mins)

9-2 The Majority Baseline (8 mins)

9-3 Evaluation Measures for Classification Models (16 mins)

9-4 Decision Trees (35 mins)

9-5 Lazy Learners (K-Nearest Neighbors) (11 mins)

Module 10: Machine Learning and Missing Data

Slides

Module 10 Slides

Videos

10-1 Review of Key ML Concepts and Performance Measures (15 mins)

10-2 The Missing Data Problem in a Nutshell (5 mins)

10-3 Why are the Data Missing? (8 mins)

10-4 Diagnosing Missing Values (18 mins)

Module 11: Supervised Learning Algorithms 2

Slides and content in this Module were developed by V.G. Vinod Vydiswaran, PhD. They are shared here with his permission.

Slides

Module 11 Slides (Part 1)

Module 11 Slides (Part 2)

Videos

11-1 Support Vector Machines (22 mins)

11-2 Perceptrons and Neural Networks (14 mins)

11-3 Naive Bayes (28 mins)

11-4 Review of Supervised Learning Algorithms (8 mins)

11-5 Combining Classifiers (21 mins)

Module 12: Machine Learning in Clinical Practice

Slides

None

Videos

None

Papers

Coming soon.

Module 13: Analyzing Health Text Data

Slides

Module 13 Slides

Videos

13-1 Reading in Text Data and Calculating Term Frequencies (27 mins)

13-2 Why Common Words are Not Useful (and How tf-idf Can Help) (25 mins)

Report abuse