Tidyverse with GitHub Copilot for Healthcare Analytics – Part 1

[This article was first published on R Works, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The role of healthcare analytics

The primary objective of healthcare analytics is to seek benefits for health administrators, organizations, and patients, most importantly by enhancing patient experiences and improving health outcomes. The healthcare industry generates vast amounts of data, primarily from electronic medical records (EMH) and administrative data.

EMH data aims to improve the accuracy of diagnoses and treatment plans while enhancing overall care quality by making patients’ health histories readily accessible to authorized providers. In contrast, administrative data encompasses patient interaction details, such as diagnoses and hospital readmissions, which can be analyzed to evaluate healthcare delivery, advantages, disadvantages, and cost-effectiveness.

Using GitHub Copilot in RStudio with a health dataset

In this first post in a series of two, I will introduce a complex healthcare data set and outline the problems I propose to solve using tidyverse and GitHub Copilot, which will facilitate nuanced argument tuning in tidyverse functions and accelerate our data analysis process.

This series is designed to help you refine and improve your data analysis skills and understand the role of AI in healthcare analytics, while demonstrating the practical application of R and Copilot in real-world scenarios.

Before we proceed, a few words of caution regarding Copilot:

  1. We will use it on a task-by-task basis rather than for the entire workflow of data exploration and hypothesis investigation. This approach allows us to leverage Copilot for challenging data wrangling and cleaning tasks, saving time and enabling deeper analytical thinking. However, it is crucial to verify the code output from any AI pair programmer—always run the code to ensure it produces the desired results and manually triangulate to confirm accuracy.

  2. On the important topic of data sensitivity and the use of Copilot, take a look at some useful discussions on Posit forums – here and here. In general, it is a good idea to consult your IT teams and follow any organizational guidelines around using AI tools, especially when it concerns sensitive data.

Diabetes data from UCI

Understanding the data

I first encountered the UC Irvine (UCI) diabetes data a few years ago when I was exploring the types of healthcare data sets freely available for users. The UCI machine learning repository offers a rich set of options – the diabetes data captivated me due to its size and the volume of information it gathered, where each row is an inpatient hospital diabetic encounter, i.e., diabetes was entered as a diagnosis, where lab tests were performed and medications were administered. The original paper that used this data can be found here, and the data itself is here.

Let’s look at the data with an example of a diabetic encounter with a patient admitted for heart failure.

Show the code for reading in data
# --------------------------------------------- #
# Load necessary libraries
# --------------------------------------------- #
library(dplyr)
library(ggplot2)
library(tidyr)
library(kableExtra)
# --------------------------------------------- #
# Read in the data, show example row
# --------------------------------------------- #
D <- read.csv("diabetes_data.csv", sep = ",")

row_ex <- t(D[47, c(-1, -2, -3, -4)])
row_ex <- data.frame("Fields" = rownames(row_ex), "Values" = row_ex[, 1], row.names = NULL)

knitr::kable(
  row_ex,
  caption = "An example data row {#tbl-ex}",
  booktabs = TRUE
) %>%
  kable_styling(full_width = F) %>%
  row_spec(c(44:46), background = "lightgray")
Table 1: An example data row
Fields Values
age [70-80)
weight ?
admission_type_id 3
discharge_disposition_id 5
admission_source_id 4
time_in_hospital 9
payer_code ?
medical_specialty InternalMedicine
num_lab_procedures 25
num_procedures 3
num_medications 16
number_outpatient 0
number_emergency 0
number_inpatient 2
diag_1 428
diag_2 427
diag_3 250.01
number_diagnoses 7
max_glu_serum None
A1Cresult None
metformin No
repaglinide No
nateglinide No
chlorpropamide No
glimepiride No
acetohexamide No
glipizide No
glyburide Steady
tolbutamide No
pioglitazone No
rosiglitazone No
acarbose No
miglitol No
troglitazone No
tolazamide No
examide No
citoglipton No
insulin Down
glyburide.metformin No
glipizide.metformin No
glimepiride.pioglitazone No
metformin.rosiglitazone No
metformin.pioglitazone No
change Ch
diabetesMed Yes
readmitted

Look at the richness of the encounter data. This row is a year old who was admitted as an elective patient, transferred from another hospital, spent days in the hospital, had lab procedures and was given medications; there was a primary diagnosis of a circulatory disease (specifically, heart failure), and an additional secondary diabetes diagnosis; the patient was given diabetes medications (one of which was insulin), which were adjusted/changed, no A1c test was administered; they were discharged to another inpatient care institution, and were readmitted in days. The 3 primary response variables that we will examine later are colored in GRAY.

Data transformation with some help from Copilot

There are rows of such encounters! To be able to use this data, we need to get to the bottom of its fields and identify which ones will tell us the story that will have implications for patient management and improved health outcomes. Let’s consider some examples.

Hospital admission and diagnoses fields like admission_type_id, discharge_disposition_id, admission_source_id, and diag_1 are defined using codes that are mapped to values in a separate data file provided with the data, which we will use to transform these fields into a usable format. We will explore how Copilot can assist with this transformation.

Setting up Copilot in RStudio

Adding Copilot as a pair programmer to your RStudio is simple, and has a seamless interface. Take a look at the documentation here.

When using Copilot directly in RStudio, it is essential to provide sufficient detail in your prompts. We will begin with admission_type_id:

Notes on Copilot
  • Take a look at the structure of the output above – the first part is my written prompt, while the ghost text is the output you that generates automatically when you hit ENTER after the prompt. In order for the output code itself to manifest, all you do it hit TAB. All subsequent Copilot outputs will be of this structure.
  • Its good to see we didn’t even have to specify the use of dplyr for the transformation! Note that without Copilot, my original strategy was to use if-else(), which makes for a slightly more unwieldy code chunk; I much prefer the use of case_when().

Now let’s handle the slightly more complicated diag_1, which defines primary diagnosis using ICD9 codes; see Table here. We will need to specify the mapping very clearly, as below.

These transformations will help us conduct a range of exploratory analyses and visualizations. For instance, primary diagnoses can significantly influence whether a patient receives the A1c test for diabetes. Let’s prepare the data to implement these in part .

Show the code for transformation
# --------------------------------------------- #
# Transformations --> admission_type_id, diag_1, A1c, readmitted
# --------------------------------------------- #

D <- D %>%
  mutate(
    admission_type = case_when(
      admission_type_id == 1 ~ "Emergency",
      admission_type_id == 2 ~ "Urgent",
      admission_type_id == 3 ~ "Elective",
      admission_type_id == 4 ~ "Newborn",
      admission_type_id == 5 ~ "Not Available",
      admission_type_id == 6 ~ "NULL",
      admission_type_id == 7 ~ "Trauma Center",
      admission_type_id == 8 ~ "Not Mapped"
    ), primary_diag = case_when(
      diag_1 %in% c(390:459, 785) ~ "Circulatory",
      diag_1 %in% c(460:519, 786) ~ "Respiratory",
      diag_1 %in% c(520:579, 787) ~ "Digestive",
      diag_1 %in% c(580:629, 788) ~ "Genitourinary",
      diag_1 %in% c(630:679) ~ "Pregnancy",
      diag_1 %in% c(680:709, 782) ~ "Skin",
      diag_1 %in% c(710:739) ~ "Musculoskeletal",
      diag_1 %in% c(740:759) ~ "Congenital",
      diag_1 %in% c(800:999) ~ "Injury",
      grepl("^250", diag_1) ~ "Diabetes",
      is.na(diag_1) ~ "Missing",
      TRUE ~ "Other"
    ), a1c = ifelse(A1Cresult == "None", "not measured", "measured"),
    reAdmit = ifelse(readmitted == "<30", "early readmit", "no/late readmit")
  )
Show the code for testing Copilot output
# --------------------------------------------- #
# Testing the Copilot output for diagnoses
# --------------------------------------------- #

testDiag <- D %>%
  subset(primary_diag == "Respiratory") %>%
  group_by(diag_1, primary_diag) %>%
  summarise(n = n())
result <- testDiag %>%
  group_by(primary_diag) %>%
  summarise(diag_1 = paste(diag_1, collapse = ", ")) %>%
  ungroup()

knitr::kable(
  # list(testDiag[1:21,], testDiag[22:42,]),
  result,
  caption = "Testing Copilot output - ICD9 codes for a respiratory primary diagnosis {#tbl-val}",
  booktabs = TRUE
)
Table 2: Testing Copilot output - ICD9 codes for a respiratory primary diagnosis
primary_diag diag_1
Respiratory 461, 462, 463, 464, 465, 466, 470, 471, 473, 474, 475, 477, 478, 480, 481, 482, 483, 485, 486, 487, 490, 491, 492, 493, 494, 495, 496, 500, 501, 506, 507, 508, 510, 511, 512, 513, 514, 515, 516, 518, 519, 786

We see that the ICD9 values for a respiratory diagnosis check out in the output. Also note there is no code (a common cold diagnosis).

Missing values

Before proceeding to part , we must concede that the complexity of this data is compounded by missing values, which can further confound us by how certain fields are defined. For example:

  1. Is a value reported as “NULL” or “Not Available,” or is it simply missing? We can use a mosaic plot to explore missing values for admission_type against age for more nuanced insights.

  2. Does the absence of data in certain fields impact our analysis? For example, while diag_2 and diag_3 may be missing, having a value for diag_1 (the primary diagnosis) may mitigate their impact on our inquiries. We will ask Copilot to help us investigate these three diagnoses and their missing values.

Let’s implement the above, using Copilot’s recommendations for the second question.

Show the code for plot 1
# --------------------------------------------- #
# 1) Missing value vs NULL for admission type
# --------------------------------------------- #
p <- D %>%
  group_by(admission_type, age) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

ggplot(p, aes(x = age, y = admission_type)) +
  geom_tile(aes(fill = n)) +
  scale_fill_gradient(low = "white", high = "blue", labels = function(n) scales::comma(abs(n))) +
  labs(title = "Plot 1: Admission type vs Age - examining missing values") +
  theme(plot.title = element_text(size = 14, color = "black", hjust = 0.5)) +
  geom_text(aes(label = scales::comma(n)), size = 3, color = "gray")

Show the code for plot 2
# --------------------------------------------- #
# 2) Missing values for diagnoses
# --------------------------------------------- #

missing_diag <- D %>%
  select(diag_1, diag_2, diag_3) %>%
  summarise(across(everything(), ~ sum(. == "?"))) %>%
  pivot_longer(cols = everything(), names_to = "Diagnosis", values_to = "Missing") %>%
  mutate(Diagnosis = factor(Diagnosis, levels = c("diag_1", "diag_2", "diag_3")))

ggplot(missing_diag, aes(x = Diagnosis, y = Missing)) +
  geom_bar(stat = "identity", fill = "#56B4E9", color = "black") +
  labs(x = "", y = "Number of Missing Values") +
  theme_minimal() +
  labs(title = "Plot 2: Diagnoses - examining missing values") +
  theme(plot.title = element_text(size = 14, color = "black", hjust = 0.5)) +
  scale_y_continuous(labels = function(n) scales::comma(abs(n)))

Here are a few notes on these plots (click on “Show the code ..” above each plot for more details) –

  1. In plot 1, it’s useful to see the missing value pattern, via NULL, Not Available and Not Mapped values. Also, by plotting admission_type against age, we can see encounters where the data was the most dense, e.g., emergency admissions for the year age group.

  2. In plot 2, we see how diag_1 has close to no missing values, which is helpful since the primary diagnosis in the encounter will be an important factor in shaping patient care response variables.

Notes on Copilot:

For the missing value analysis of the diagnosis fields in plot , I appreciate the utility of pivot_longer() and across(everything()), two tidy functions I have not previously utilized, here recommended by Copilot. The synergy between our analytical vision and Copilot’s assistance significantly enhances our productivity. Note that I did modify the Copilot recommendation quite a bit for the visualization, adding labels = function(n) scales::comma(abs(n)) to neatly format numbers; its a function I have used repeatedly for many years for an easier review of graphics that display numbers.

Conclusion and a glimpse into part 2

Healthcare data holds immense potential to enhance patient outcomes, making the ability to navigate complex datasets an invaluable skill for the future. In this post, we introduced a diabetes patient encounter data set, demonstrating how to preprocess it for analysis and highlighting key response variables and covariates that will be explored in part . While we occasionally leveraged Copilot for assistance, precise prompts were essential for effective results.

Thus, in the upcoming part , we will delve into three critical response variables (see Table 1): early readmission, specifically investigating the factors influencing readmission within days of discharge; diabetes medication prescriptions during hospital encounters; and any change in medications. Our analysis will prioritize correlation over causation, continuing to use ggplot2 and dplyr to extract meaningful insights from the data.

We will closely examine the following relationships:

  1. A1c test and early readmission

  2. A1c test and medication prescription

  3. A1c test and medication changes

Here, the A1c test is defined as a binary indicator of whether this vital assessment was conducted. This analysis is pivotal for understanding patient management and hospital readmission, ultimately contributing to improved patient care outcomes. Stay tuned for more!

Some more tips on Copilot in RStudio

  1. After Copilot is running in your RStudio, take another look at Posit’s guide, especially where they describe the most effective way to use it while working within RStudio. For example:

Code suggestions are typically most useful when applied to a well-scoped and specific problem. When trying to solve larger problems or write longer functions, it is best to break the problem down into smaller pieces and use Copilot and your own expertise to generate code for each chunk. Similar to how a chef might use a recipe to cook each dish that makes up a larger meal, Copilot can be used to generate code for smaller pieces of a larger problem.

  1. When starting to use Copilot for the first time within RStudio, try asking it simpler questions as prompts, to get used to its autocompletions, e.g.,
  • # summarize the data
  • # plot glucose vs bmi in a scatterplot
  1. You can also begin writing a piece of code and allow Copilot to finish it via autocomplete, e.g.,

About the Author

Vidisha writes: “I am a statistician and data science professional. For practitioners like me, ‘tidyverse’ is a household word. I’ve used it for complex data wrangling, visualization, and advanced analytics work, especially where I needed clarity on how to leverage my data so that it allows for story-telling that informs business decisions, recasting my work into actionable insights. I recently began exploring GPT-4o mini and GitHub Copilot to help me speed up my workflows, especially within Healthcare Analytics. I am excited to share this post series to help readers understand the role of RStudio and AI in Healthcare Analytics!”

To leave a comment for the author, please follow the link and comment on their blog: R Works.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)