TP: A “Tidy Patient” approach to scaleable RWD analytics

80% of our teams’ time with real-world data is spent preparing data until it is an analysis-ready one-row-per-patient (ORPP) table containing results relevant to a specific use-case. ~ Data science leader at a top 5 pharmaceutical company

Consider a user Katie who is tasked with characterizing a patient population from EHR (we’ll call the dataset “Blue Lake EHR”). The patient population is one that satisfies a series of inclusion / exclusion criteria, across a wide range of variables including demographics, drug, labs, genetics, and outcomes.

For example, the patient population may be one where the patients satisfy 8 criteria across a variety of domains, from demographics, to diagnoses, genetics and treatment.

Criteria	Type	Demographics
1	Demographics	Is Female
2	Demographics	At least 18 at diagnosis
3	Diagnosis	Has metastatic breast cancer diagnosis
4	Diagnosis	Has diabetes diagnosis
5	Genetics	Is HR+ at diagnosis
6	Genetics	Is HER2- at diagnosis
7	Treatment	Has taken Palbociclib in 1LOT
8	Treatment	Has taken Fulvestrant in 2LOT

What Katie needs in order to conduct her analysis using standard statistical software (such as R, Python, SAS) is:

A one-row-per-patient (ORPP) table containing key raw and derived variables
An attrition chart showing how the total number of patients changed with the application of each inclusion/exclusion criteria.
A data dictionary containing definitions of each column.

If Katie is like clients we’ve worked with, she’ll quickly discover she’s in for a rude awakening when she finally gets access to the data.

Many users of real world data are in for a rude awakening when they see it for the first time.

Once Katie opens the data, she’s going to quickly realize the data don’t look nearly as ready for her use case as she thought.

In our experience, this is due to a conflict between Sally’s expectations and reality in 4 key domains: Structure, Patient-Level Summaries, Relevancy, and Documentation

Domain	RWD Expectation (hope?)	RWD Reality
Structure	Data are represented in a single table	Data are spread across multiple tables
Patient-level summaries	Data are represented at the patient level (one-row-per-patient “ORPP”)	Data are longitudinal and contain both ORPP and multiple-row-per-patient “MRPP” values)
Relevancy	Only data relevant to the target use case are present	Most data are almost certainly completely outside of the scope of any specific study.
Documentation	Documentation clearly explains how to use the data for my use case.	Documentation explains raw data well, but does not speak to how to put the pieces together for a specific use-case

Users quickly realize a lot of work is needed before they can get to the real analysis

After the rude awakening, users realize that a lot of work is needed before the data will be ready to input into a statistical procedure such as a survival analysis.

Data preparation questions

How do I reduce multiple-row-per-patient (MRPP) values to one-row-per-patient (ORPP)?

What was the date when they started 1LOT?
Were they tested for biomarker X within 60 days of diagnosis? If yes, what was the result?
When was the last known contact date for the patient?

Which patients satisfy the I/E criteria for my study and how do I track those who fit the criteria and those who are removed?
How do I join results across multiple tables?
How can I do all of these things while maintaining key metadata and provenance, such as a data dictionary?

While there may be high level guidance given by the data provider or scientific literature, none completely describe the needs of the individual’s use case.

Many teams solve the data preparation problem using custom, bespoke processes

Characteristics of inefficient analytics processes

Custom scripts
Copying and pasting from prior work
Possible documentation in code.

Why is this custom, bespoke process bad for science AND business

Why is this bad?

Lots of work
Room for errors
Lots of time to train people
Cannot scale.
Too much freedom! People can easily do a lot of dumb things and get overwhelmed with choice.

If you ask 10 analysts to do the same thing, they will implement it 11 different ways.

There is a better way: ‘tidy-patient’

Characteristics of efficient analytics processes

How can we develop systems to enable tidy patient results?

Define and leverage modular functions that have already been reviewed.

If someone has implemented a good way to define biomarker status at index once, the logic shouldn’t be re-written.

Create families of related functions with similar behavior

“Add functions add variables, derive functions calculate new variables, filter functions remove patients”

Automate whenever possible to reduce unnecessary work

Removing patients based on I/E criteria should automatically update attrition
Adding derived variables should update a data dictionary

Code should be human readable

Should look like a recipe, and not contain any implementation details

Best practices should be encouraged through user messages, warnings, and notes

If there are known best practices for supporting a specific use-case, we can encourage those best practices in code

What can a modular approach look like?

Consider the following question:

I need to create an analysis from the “oncology_ehr” dataset satisfying the following criteria:

# Tidy patient approach

# 1: Start analysis
start_analysis(source = "oncology_ehr") |>

# 2: Add patient-level variables

## Add existing ORPP tables
add_orpp(table = "demographics") |>
add_orpp(table = "mortality") |>

## Add diagnosis information for diagnoses that match
##  a pre-defined codelist
add_diagnosis(codelist = code_breastcancer(),
              method = "earliest",
              title = "mbc_diagnosis_date") |>

## Set mbc_diagnosis_date to be the indexdate for all subsequent analyses
define_indexdate("mbc_diagnosis_date") |>

## Add biomarker status at index date, considering biomarker
##   results taht take place between -180 and 15 days of index
add_biomarker_status(biomarker_name = "HR",
                     days_before_index = -180,
                     days_after_index = 15) |>

## Add evidence for drug episodes taking place after index
add_drugs(drug_name = "palbociclib",
          days_before_index = 0,
          days_after_index = Inf) |>
add_drugs(drug_name = "fulvestrant",
          days_before_index = 0,
          days_after_index = Inf) |>

# 2b Derive --------------------------------------------------

## Derive age at event
derive_age_at_event(event_date = "mbc_diagnosis_date") |>

# 3: Inclusion / Exclusion -----------------------------------

# Define inclusion / exclusion crtieria
define_inclusion(mbc_diagnosis_date >= "2016-01-01") |>
define_inclusion(sex == "Female") |>
define_inclusion(age_at_mbc_diagnosis_date >= 18) |>
define_inclusion(status_biomarker_hr == "positive") |>
define_inclusion(has_palbociclib) |>
define_inclusion(has_fulvestrant)

As you can see, this code is

Fast - about 15 lines of code.
Customizable - leverages arguments to make it easy to make changes.
Frugal - contains only the high-level logic defined in the analysis plan

How can someone implement the right approach?

Use any language capable of creating functions (R, Python, SAS).
Centralize the package in a version controlled repository.

Wrap-Up and Next Steps

In our next post

TP: Applying the “Tidy Patient” approach to medicare claims

we’ll show you how the tidy patient approach works when applied to medicare data.