Tidyverse with GitHub Copilot for Healthcare Analytics – Part 1
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The role of healthcare analytics
The primary objective of healthcare analytics is to seek benefits for health administrators, organizations, and patients, most importantly by enhancing patient experiences and improving health outcomes. The healthcare industry generates vast amounts of data, primarily from electronic medical records (EMH) and administrative data.
EMH data aims to improve the accuracy of diagnoses and treatment plans while enhancing overall care quality by making patients’ health histories readily accessible to authorized providers. In contrast, administrative data encompasses patient interaction details, such as diagnoses and hospital readmissions, which can be analyzed to evaluate healthcare delivery, advantages, disadvantages, and cost-effectiveness.
Using GitHub Copilot in RStudio with a health dataset
In this first post in a series of two, I will introduce a complex healthcare data set and outline the problems I propose to solve using tidyverse
and GitHub Copilot, which will facilitate nuanced argument tuning in tidyverse
functions and accelerate our data analysis process.
This series is designed to help you refine and improve your data analysis skills and understand the role of AI in healthcare analytics, while demonstrating the practical application of R and Copilot in real-world scenarios.
Before we proceed, a few words of caution regarding Copilot:
We will use it on a task-by-task basis rather than for the entire workflow of data exploration and hypothesis investigation. This approach allows us to leverage Copilot for challenging data wrangling and cleaning tasks, saving time and enabling deeper analytical thinking. However, it is crucial to verify the code output from any AI pair programmer—always run the code to ensure it produces the desired results and manually triangulate to confirm accuracy.
On the important topic of data sensitivity and the use of Copilot, take a look at some useful discussions on Posit forums – here and here. In general, it is a good idea to consult your IT teams and follow any organizational guidelines around using AI tools, especially when it concerns sensitive data.
Diabetes data from UCI
Understanding the data
I first encountered the UC Irvine (UCI) diabetes data a few years ago when I was exploring the types of healthcare data sets freely available for users. The UCI machine learning repository offers a rich set of options – the diabetes data captivated me due to its size and the volume of information it gathered, where each row is an inpatient hospital diabetic encounter, i.e., diabetes was entered as a diagnosis, where lab tests were performed and medications were administered. The original paper that used this data can be found here, and the data itself is here.
Let’s look at the data with an example of a diabetic encounter with a patient admitted for heart failure.
Show the code for reading in data
# --------------------------------------------- # # Load necessary libraries # --------------------------------------------- # library(dplyr) library(ggplot2) library(tidyr) library(kableExtra) # --------------------------------------------- # # Read in the data, show example row # --------------------------------------------- # D <- read.csv("diabetes_data.csv", sep = ",") row_ex <- t(D[47, c(-1, -2, -3, -4)]) row_ex <- data.frame("Fields" = rownames(row_ex), "Values" = row_ex[, 1], row.names = NULL) knitr::kable( row_ex, caption = "An example data row {#tbl-ex}", booktabs = TRUE ) %>% kable_styling(full_width = F) %>% row_spec(c(44:46), background = "lightgray")
Fields | Values |
---|---|
age | [70-80) |
weight | ? |
admission_type_id | 3 |
discharge_disposition_id | 5 |
admission_source_id | 4 |
time_in_hospital | 9 |
payer_code | ? |
medical_specialty | InternalMedicine |
num_lab_procedures | 25 |
num_procedures | 3 |
num_medications | 16 |
number_outpatient | 0 |
number_emergency | 0 |
number_inpatient | 2 |
diag_1 | 428 |
diag_2 | 427 |
diag_3 | 250.01 |
number_diagnoses | 7 |
max_glu_serum | None |
A1Cresult | None |
metformin | No |
repaglinide | No |
nateglinide | No |
chlorpropamide | No |
glimepiride | No |
acetohexamide | No |
glipizide | No |
glyburide | Steady |
tolbutamide | No |
pioglitazone | No |
rosiglitazone | No |
acarbose | No |
miglitol | No |
troglitazone | No |
tolazamide | No |
examide | No |
citoglipton | No |
insulin | Down |
glyburide.metformin | No |
glipizide.metformin | No |
glimepiride.pioglitazone | No |
metformin.rosiglitazone | No |
metformin.pioglitazone | No |
change | Ch |
diabetesMed | Yes |
readmitted |
Look at the richness of the encounter data. This row is a year old who was admitted as an elective patient, transferred from another hospital, spent
days in the hospital, had
lab procedures and was given
medications; there was a primary diagnosis of a circulatory disease (specifically, heart failure), and an additional secondary diabetes diagnosis; the patient was given diabetes medications (one of which was insulin), which were adjusted/changed, no A1c test was administered; they were discharged to another inpatient care institution, and were readmitted in
days. The 3 primary response variables that we will examine later are colored in GRAY.
Data transformation with some help from Copilot
There are rows of such encounters! To be able to use this data, we need to get to the bottom of its
fields and identify which ones will tell us the story that will have implications for patient management and improved health outcomes. Let’s consider some examples.
Hospital admission and diagnoses fields like admission_type_id
, discharge_disposition_id
, admission_source_id
, and diag_1
are defined using codes that are mapped to values in a separate data file provided with the data, which we will use to transform these fields into a usable format. We will explore how Copilot can assist with this transformation.
Adding Copilot as a pair programmer to your RStudio is simple, and has a seamless interface. Take a look at the documentation here.
When using Copilot directly in RStudio, it is essential to provide sufficient detail in your prompts. We will begin with admission_type_id
:
- Take a look at the structure of the output above – the first part is my written prompt, while the ghost text is the output you that generates automatically when you hit ENTER after the prompt. In order for the output code itself to manifest, all you do it hit TAB. All subsequent Copilot outputs will be of this structure.
- Its good to see we didn’t even have to specify the use of
dplyr
for the transformation! Note that without Copilot, my original strategy was to useif-else()
, which makes for a slightly more unwieldy code chunk; I much prefer the use ofcase_when()
.
Now let’s handle the slightly more complicated diag_1
, which defines primary diagnosis using ICD9 codes; see Table here. We will need to specify the mapping very clearly, as below.
These transformations will help us conduct a range of exploratory analyses and visualizations. For instance, primary diagnoses can significantly influence whether a patient receives the A1c test for diabetes. Let’s prepare the data to implement these in part .
Show the code for transformation
# --------------------------------------------- # # Transformations --> admission_type_id, diag_1, A1c, readmitted # --------------------------------------------- # D <- D %>% mutate( admission_type = case_when( admission_type_id == 1 ~ "Emergency", admission_type_id == 2 ~ "Urgent", admission_type_id == 3 ~ "Elective", admission_type_id == 4 ~ "Newborn", admission_type_id == 5 ~ "Not Available", admission_type_id == 6 ~ "NULL", admission_type_id == 7 ~ "Trauma Center", admission_type_id == 8 ~ "Not Mapped" ), primary_diag = case_when( diag_1 %in% c(390:459, 785) ~ "Circulatory", diag_1 %in% c(460:519, 786) ~ "Respiratory", diag_1 %in% c(520:579, 787) ~ "Digestive", diag_1 %in% c(580:629, 788) ~ "Genitourinary", diag_1 %in% c(630:679) ~ "Pregnancy", diag_1 %in% c(680:709, 782) ~ "Skin", diag_1 %in% c(710:739) ~ "Musculoskeletal", diag_1 %in% c(740:759) ~ "Congenital", diag_1 %in% c(800:999) ~ "Injury", grepl("^250", diag_1) ~ "Diabetes", is.na(diag_1) ~ "Missing", TRUE ~ "Other" ), a1c = ifelse(A1Cresult == "None", "not measured", "measured"), reAdmit = ifelse(readmitted == "<30", "early readmit", "no/late readmit") )
Show the code for testing Copilot output
# --------------------------------------------- # # Testing the Copilot output for diagnoses # --------------------------------------------- # testDiag <- D %>% subset(primary_diag == "Respiratory") %>% group_by(diag_1, primary_diag) %>% summarise(n = n()) result <- testDiag %>% group_by(primary_diag) %>% summarise(diag_1 = paste(diag_1, collapse = ", ")) %>% ungroup() knitr::kable( # list(testDiag[1:21,], testDiag[22:42,]), result, caption = "Testing Copilot output - ICD9 codes for a respiratory primary diagnosis {#tbl-val}", booktabs = TRUE )
primary_diag | diag_1 |
---|---|
Respiratory | 461, 462, 463, 464, 465, 466, 470, 471, 473, 474, 475, 477, 478, 480, 481, 482, 483, 485, 486, 487, 490, 491, 492, 493, 494, 495, 496, 500, 501, 506, 507, 508, 510, 511, 512, 513, 514, 515, 516, 518, 519, 786 |
We see that the ICD9 values for a respiratory diagnosis check out in the output. Also note there is no code (a common cold diagnosis).
Missing values
Before proceeding to part , we must concede that the complexity of this data is compounded by missing values, which can further confound us by how certain fields are defined. For example:
Is a value reported as “NULL” or “Not Available,” or is it simply missing? We can use a mosaic plot to explore missing values for
admission_type
againstage
for more nuanced insights.Does the absence of data in certain fields impact our analysis? For example, while
diag_2
anddiag_3
may be missing, having a value fordiag_1
(the primary diagnosis) may mitigate their impact on our inquiries. We will ask Copilot to help us investigate these three diagnoses and their missing values.
Let’s implement the above, using Copilot’s recommendations for the second question.
Show the code for plot 1
# --------------------------------------------- # # 1) Missing value vs NULL for admission type # --------------------------------------------- # p <- D %>% group_by(admission_type, age) %>% summarise(n = n()) %>% mutate(freq = n / sum(n)) ggplot(p, aes(x = age, y = admission_type)) + geom_tile(aes(fill = n)) + scale_fill_gradient(low = "white", high = "blue", labels = function(n) scales::comma(abs(n))) + labs(title = "Plot 1: Admission type vs Age - examining missing values") + theme(plot.title = element_text(size = 14, color = "black", hjust = 0.5)) + geom_text(aes(label = scales::comma(n)), size = 3, color = "gray")
Show the code for plot 2
# --------------------------------------------- # # 2) Missing values for diagnoses # --------------------------------------------- # missing_diag <- D %>% select(diag_1, diag_2, diag_3) %>% summarise(across(everything(), ~ sum(. == "?"))) %>% pivot_longer(cols = everything(), names_to = "Diagnosis", values_to = "Missing") %>% mutate(Diagnosis = factor(Diagnosis, levels = c("diag_1", "diag_2", "diag_3"))) ggplot(missing_diag, aes(x = Diagnosis, y = Missing)) + geom_bar(stat = "identity", fill = "#56B4E9", color = "black") + labs(x = "", y = "Number of Missing Values") + theme_minimal() + labs(title = "Plot 2: Diagnoses - examining missing values") + theme(plot.title = element_text(size = 14, color = "black", hjust = 0.5)) + scale_y_continuous(labels = function(n) scales::comma(abs(n)))
Here are a few notes on these plots (click on “Show the code ..” above each plot for more details) –
In plot 1, it’s useful to see the missing value pattern, via
NULL
,Not Available
andNot Mapped
values. Also, by plottingadmission_type
againstage
, we can see encounters where the data was the most dense, e.g., emergency admissions for theyear age group.
In plot 2, we see how
diag_1
has close to no missing values, which is helpful since the primary diagnosis in the encounter will be an important factor in shaping patient care response variables.
For the missing value analysis of the diagnosis fields in plot , I appreciate the utility of
pivot_longer()
and across(everything())
, two tidy functions I have not previously utilized, here recommended by Copilot. The synergy between our analytical vision and Copilot’s assistance significantly enhances our productivity. Note that I did modify the Copilot recommendation quite a bit for the visualization, adding labels = function(n) scales::comma(abs(n))
to neatly format numbers; its a function I have used repeatedly for many years for an easier review of graphics that display numbers.
Conclusion and a glimpse into part 2
Healthcare data holds immense potential to enhance patient outcomes, making the ability to navigate complex datasets an invaluable skill for the future. In this post, we introduced a diabetes patient encounter data set, demonstrating how to preprocess it for analysis and highlighting key response variables and covariates that will be explored in part . While we occasionally leveraged Copilot for assistance, precise prompts were essential for effective results.
Thus, in the upcoming part , we will delve into three critical response variables (see Table 1): early readmission, specifically investigating the factors influencing readmission within
days of discharge; diabetes medication prescriptions during hospital encounters; and any change in medications. Our analysis will prioritize correlation over causation, continuing to use
ggplot2
and dplyr
to extract meaningful insights from the data.
We will closely examine the following relationships:
A1c test and early readmission
A1c test and medication prescription
A1c test and medication changes
Here, the A1c test is defined as a binary indicator of whether this vital assessment was conducted. This analysis is pivotal for understanding patient management and hospital readmission, ultimately contributing to improved patient care outcomes. Stay tuned for more!
Some more tips on Copilot in RStudio
- After Copilot is running in your RStudio, take another look at Posit’s guide, especially where they describe the most effective way to use it while working within RStudio. For example:
Code suggestions are typically most useful when applied to a well-scoped and specific problem. When trying to solve larger problems or write longer functions, it is best to break the problem down into smaller pieces and use Copilot and your own expertise to generate code for each chunk. Similar to how a chef might use a recipe to cook each dish that makes up a larger meal, Copilot can be used to generate code for smaller pieces of a larger problem.
- When starting to use Copilot for the first time within RStudio, try asking it simpler questions as prompts, to get used to its autocompletions, e.g.,
# summarize the data
# plot glucose vs bmi in a scatterplot
- You can also begin writing a piece of code and allow Copilot to finish it via autocomplete, e.g.,
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.