What do we mean by onboarding? Essentially, setting up the knowledge and technical processes that enable teams to actually start getting business value from RWD. One metric we like to use to define getting initial value from RWD is “Days-to-First-Insight”. We define days-to-first-insight as the number of days that pass from receiving a RWD asset to sharing a trustworthy, actionable, business-relevant insight gained from the data with key stakeholders. In our experience both as providers and users of RWD, the shorter a team’s days-to-first-insight value is, the more likely they are to continue investing in RWD and achieve value. By contrast, the longer a team’s days-to-first-insight is, the less likely they are to continue investing in it and the more likely they are to abandon it. In our experience, if a first-time user of RWD has a days-to-first-insight value less than 90 days, they are highly likely to embrace the asset and continue investing in it. By contrast, if the onboarding process for a new RWD asset delays a team’s day-to-first-insight to more than 90 days they are likely to abandon
What are the common hurdles teams face in their onboarding process that delay their days-to-first-insight
from real world data assets? Below, we share our top 5.
Hurdles
1: Finding the data
If you work in a clinical research organization, you likely have a wealth of data available to you. This data can come from a variety of sources, including electronic health records, claims data, and clinical trials. However, simply having this data is not enough. To truly maximize its value, you need to make sure it is discoverable and usable for your organization's analysts and scientists. In our experience, organizations often over-focus, and overspend on acquiring access to additional data assets before fully understanding the wealth of data already available to them.
2: Understanding the data model
Understanding the data dictionary and how data are distributed and related across tables can be a daunting task. A well-structured and documented data model will enable users to quickly isolate the necessary tables and fields to support an analysis. Conversely, a poorly structured or poorly documented data model will lead to confusion and frustration.
3: Detecting and understanding data quality issues (quickly)
Real-world data is messy and complex. Even the highest quality real-world data will often present data quality issues, such as substantive missingness or unexpected values, for specific use cases. Some amount of data missingness might be entirely consistent with how the data were collected and easy to
account for, while others might indicate substantial data quality issues and fundamentally jeopardize the researcher’s ability to answer their desired research question (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6718007/ for a commentary on characterizing real-world
data relevance and quality). When RWD quality issues are not identified, understood, and addressed quickly, they can lead to countless hours of wasted resources.
4: Generating analysis-ready data
Analysis-ready data (ARD) are datasets that are in a "tidy" format, with each row representing a patient, and each column representing patient-level information relevant to a specific analysis use-case. Creating ARDs involves summarizing events (such as reducing multiple-row-per-patient tables to single values), joining tables, calculating derived values (such as defining index dates and calculating derived variables
such as follow-up time from index to outcomes) and filtering results, while preserving data provenance and metadata.
5: Capturing metadata and data provenance
Every analytic report generated from RWD should contain key metadata, such as data cut-off and capture dates, and data provenance. This includes information about when and how data were processed or filtered from the raw data to the final analysis data. When metadata and data provenance
are not captured and surfaced correctly, users may face seemingly conflicting results due to differences in how the data were processed.
Solutions
1: Tools to discover existing RWD assets
If you work in a clinical research organization, you likely have a wealth of data available to you. This data can come from a variety of sources, including electronic health records, claims data, and clinical trials. However, simply having this data is not enough. To truly maximize its value, you need to make sure it is discoverable and usable for your organization's analysts and scientists. In our experience, organizations often over-focus, and overspend on acquiring access to additional data assets before fully understanding the wealth of data already available to them. Create a Centralized Data Catalog to maximize discoverability The first step in leveraging your pre-existing clinical data assets is to make sure they are discoverable. This means that analysts and scientists should be able to easily find the appropriate data they need, without spending hours searching through different databases and systems. One way to maximize discoverability is to create a centralized data catalog. This catalog should include information about all of the data assets available within your organization, including information about the data's source, format, and any relevant metadata. By having a centralized catalog, analysts and scientists can quickly and easily find the data they need, without having to navigate through multiple systems and databases. There are several open-source solutions available for generating centralized data catalogs. Three popular options are as follows: CKAN, Apache Atlas, DKAN.
Conclusion
By maximizing discoverability and minimizing onboarding effort, you can effectively leverage your pre-existing clinical data assets. This not only helps you get more value from your data, but it also makes it easier for analysts and scientists to do their work, ultimately leading to better insights and outcomes.
2: Entity Relationship Diagrams
Real-world data assets are complex. Many users won’t realize how complex they are until they actually see the data for the first time. While different RWD assets will follow different data models, they will usually be delivered in a series of tables, with each table representing a different type of data modality. For example, one table could represent demographic information in a one-row-per-patient (ORPP) format, while another table will represent longitudinal treatment information in a multiple-row-per-patient (MRPP) format. All tables containing patient data will be linkable through a series of primary and foreign keys. A
key is an attribute (or group of attributes) that can uniquely identify records in a table. When dealing with RWD, it can be difficult for users to comprehend the relationships between tables and fields without a guide.
Entity Relationship Diagrams (ERDs)
An entity-relationship diagram (ERD) can be an invaluable tool in this regard. An ERD is a flow chart that illustrates how concepts, such as fields and tables relate to each other in a system. A well-crafted ERD can offer users a clear and concise visualization of their data model, enabling them to quickly and easily identify crucial relationships and dependencies between tables and fields.
Tools
While you can create an entity relationship diagram in any visual design tool, there are many tools available to help RWD users and providers create entity-relationship diagrams. Some, such as LucidChart and DBeaver, have functionality that allow you to automatically create them directly from the database itself. Others, such as DBDiagram, allow you to build them.
Conclusion: RWD are complex. Providing users with tools to navigate that complexity will dramatically help them onboard to new RWD assets. One effective tool for helping users understand the tables underlying a RWD asset is an Entity Relationship Diagram (ERD). ERDs can be easily created using free and open-source tools.
3: Interactive Data Dictionaries
Every data provider should offer a data dictionary for users. Typically, these dictionaries come in the form of a spreadsheet or a static Word or PDF document. While these formats are helpful, there is a better option available: an interactive web app. An interactive web app with a point-and-click interface and search fields can easily created from a flat file (such as Excel or .csv), rendered as an interactive table (through libraries such as datatable https://datatables.net/) and deployed to users (such as in a Shiny application https://shiny.rstudio.com/articles/datatables.html). Web apps allow multiple users to quickly and easily find critical table and field information from a centralized source. This can save users a significant amount of time and frustration compared to manually searching through a static document. In addition, a web app can include links to additional documentation and analysis templates, providing users with more resources to work with the data. It can also provide a low-friction way for users to submit questions or feedback on the data dictionary back to the data provider via an embedded questionnaire.
4: Mock Data
Mock data, which closely resembles the structure and format of real data, but without representing any real statistical relationships, allows users to familiarize themselves with the data structure, fields, and formats. Mock data can be extremely useful to both data engineers and data analysts in designing data ETL and analytics pipelines, and prepare for any potential usability hurdles when the real data arrive.
5: Use-case templates and analysis recipes
The best way to learn As part of a starter kit for improving the usability of real-world datasets (RWD), we recommend including a set of templates that describe common use-cases, along with a set of recommended steps for solving them.
These templates should provide users with a high-level framework for accomplishing common use-cases. Ideally, they should include an end-to-end example of code that can be used to implement the use-case in at least one programming language. By providing these end-to-end examples, users can more easily understand how to apply the templates to their own analyses, and can reduce the time required to develop and test their own analysis code.
6: Functional code and packages
Functional code, such as R and Python packages, designed specifically to help users implement common workflows with your data, can be incredibly useful for software developers. These packages provide pre-written code that can be easily integrated into new projects, allowing developers to focus on building new features rather than re-writing code that already exists. In addition, using pre-existing code can help to ensure that new projects are reliable and well-tested, since the packages they rely on have already been used by many other developers. By leveraging existing code in this way, developers can save time and effort while still producing high-quality software for their users.