80% of our teams’ time with real-world data is spent preparing data until it is an analysis-ready one-row-per-patient (ORPP) table containing results relevant to a specific use-case. ~ Data science leader at a top 5 pharmaceutical company
Consider a user Katie who is tasked with characterizing a patient population from EHR (we’ll call the dataset “Blue Lake EHR”). The patient population is one that satisfies a series of inclusion / exclusion criteria, across a wide range of variables including demographics, drug, labs, genetics, and outcomes.
- For example, the patient population may be one where the patients satisfy 8 criteria across a variety of domains, from demographics, to diagnoses, genetics and treatment.
At least 18 at diagnosis
Has metastatic breast cancer diagnosis
Has diabetes diagnosis
Is HR+ at diagnosis
Is HER2- at diagnosis
Has taken Palbociclib in 1LOT
Has taken Fulvestrant in 2LOT
What Katie needs in order to conduct her analysis using standard statistical software (such as R, Python, SAS) is:
- A one-row-per-patient (ORPP) table containing key raw and derived variables
- An attrition chart showing how the total number of patients changed with the application of each inclusion/exclusion criteria.
- A data dictionary containing definitions of each column.
If Katie is like clients we’ve worked with, she’ll quickly discover she’s in for a rude awakening when she finally gets access to the data.
Many users of real world data are in for a rude awakening when they see it for the first time.
Once Katie opens the data, she’s going to quickly realize the data don’t look nearly as ready for her use case as she thought.
In our experience, this is due to a conflict between Sally’s expectations and reality in 4 key domains: Structure, Patient-Level Summaries, Relevancy, and Documentation
RWD Expectation (hope?)
Data are represented in a single table
Data are spread across multiple tables
Data are represented at the patient level (one-row-per-patient “ORPP”)
Data are longitudinal and contain both ORPP and multiple-row-per-patient “MRPP” values)
Only data relevant to the target use case are present
Most data are almost certainly completely outside of the scope of any specific study.
Documentation clearly explains how to use the data for my use case.
Documentation explains raw data well, but does not speak to how to put the pieces together for a specific use-case
Users quickly realize a lot of work is needed before they can get to the real analysis
After the rude awakening, users realize that a lot of work is needed before the data will be ready to input into a statistical procedure such as a survival analysis.
Data preparation questions
- How do I reduce multiple-row-per-patient (MRPP) values to one-row-per-patient (ORPP)?
- What was the date when they started 1LOT?
- Were they tested for biomarker X within 60 days of diagnosis? If yes, what was the result?
- When was the last known contact date for the patient?
- Which patients satisfy the I/E criteria for my study and how do I track those who fit the criteria and those who are removed?
- How do I join results across multiple tables?
- How can I do all of these things while maintaining key metadata and provenance, such as a data dictionary?
While there may be high level guidance given by the data provider or scientific literature, none completely describe the needs of the individual’s use case.
Many teams solve the data preparation problem using custom, bespoke processes
Characteristics of inefficient analytics processes
- Custom scripts
- Copying and pasting from prior work
- Possible documentation in code.
Why is this custom, bespoke process bad for science AND business
Why is this bad?
- Lots of work
- Room for errors
- Lots of time to train people
- Cannot scale.
- Too much freedom! People can easily do a lot of dumb things and get overwhelmed with choice.
- If you ask 10 analysts to do the same thing, they will implement it 11 different ways.
There is a better way: ‘tidy-patient’
Characteristics of efficient analytics processes
How can we develop systems to enable tidy patient results?
- Define and leverage modular functions that have already been reviewed.
- If someone has implemented a good way to define biomarker status at index once, the logic shouldn’t be re-written.
- Create families of related functions with similar behavior
- “Add functions add variables, derive functions calculate new variables, filter functions remove patients”
- Automate whenever possible to reduce unnecessary work
- Removing patients based on I/E criteria should automatically update attrition
- Adding derived variables should update a data dictionary
- Code should be human readable
- Should look like a recipe, and not contain any implementation details
- Best practices should be encouraged through user messages, warnings, and notes
- If there are known best practices for supporting a specific use-case, we can encourage those best practices in code
What can a modular approach look like?
Consider the following question:
I need to create an analysis from the “oncology_ehr” dataset satisfying the following criteria:
# Tidy patient approach # 1: Start analysis start_analysis(source = "oncology_ehr") |> # 2: Add patient-level variables ## Add existing ORPP tables add_orpp(table = "demographics") |> add_orpp(table = "mortality") |> ## Add diagnosis information for diagnoses that match ## a pre-defined codelist add_diagnosis(codelist = code_breastcancer(), method = "earliest", title = "mbc_diagnosis_date") |> ## Set mbc_diagnosis_date to be the indexdate for all subsequent analyses define_indexdate("mbc_diagnosis_date") |> ## Add biomarker status at index date, considering biomarker ## results taht take place between -180 and 15 days of index add_biomarker_status(biomarker_name = "HR", days_before_index = -180, days_after_index = 15) |> ## Add evidence for drug episodes taking place after index add_drugs(drug_name = "palbociclib", days_before_index = 0, days_after_index = Inf) |> add_drugs(drug_name = "fulvestrant", days_before_index = 0, days_after_index = Inf) |> # 2b Derive -------------------------------------------------- ## Derive age at event derive_age_at_event(event_date = "mbc_diagnosis_date") |> # 3: Inclusion / Exclusion ----------------------------------- # Define inclusion / exclusion crtieria define_inclusion(mbc_diagnosis_date >= "2016-01-01") |> define_inclusion(sex == "Female") |> define_inclusion(age_at_mbc_diagnosis_date >= 18) |> define_inclusion(status_biomarker_hr == "positive") |> define_inclusion(has_palbociclib) |> define_inclusion(has_fulvestrant)
As you can see, this code is
- Fast - about 15 lines of code.
- Customizable - leverages arguments to make it easy to make changes.
- Frugal - contains only the high-level logic defined in the analysis plan
How can someone implement the right approach?
- Use any language capable of creating functions (R, Python, SAS).
- Centralize the package in a version controlled repository.
Wrap-Up and Next Steps
In our next post