Features

Clinical Data Management

Clinical data science is the future for clinical data management.

By: Michael Phillips

Senior Director, Innovation & Informatics, ICON

Clinical Data Management (CDM) is a key service in clinical research that underpins the full value chain of data delivery from clinical trials. A clinical trial is a tightly controlled experiment involving human subjects to determine the safety and efficacy of a therapeutic drug, medical/surgical procedure, or device. To reach a validated conclusion and safeguard patient safety at all times, it is essential that all required data are faithfully and accurately collected and delivered in a timely fashion for ongoing safety monitoring, protocol compliance monitoring and statistical analysis, including interim analysis.

Commercial off-the-shelf clinical data management systems (CDMS) are widely used to facilitate the source collection and management of patient data. However, that’s not even half the story because more data is captured outside the CDMS from remote devices, eSource, electronic clinical outcomes assessment vendors and other specialized vendors, and that trend is increasing. When you add to that picture the unique characteristics and requirements of each clinical trial, the setup, management, and delivery of a clinical trial data set is a complex process.

The CDM organization is responsible for designing the case report forms for the electronic data capture (EDC) system, building the EDC database for the study, programming edit checks inside the CDMS, working with third-party vendors to agree data handling processes, programming data listings outside the CDMS for discrepancy and reconciliation review, coding medical and drug terms and preparing data for Clinical Data Interchange Standards Consortium standards for downstream analysis. CDM organizations also support medical safety reviews and central monitoring as part of a wider risk-based quality management (RBQM) system in line with the International Council for Harmonization guideline for good clinical practice E6(R2).

Traditionally, the work of CDM organizations is procedural and labor intensive and has a strong study-specific component, not least because individual clinical trials are tailored research programs. However, CDM is on a continuous journey of improvement. As recently as thirteen years ago, there was a lot of focus on the digitization of paper case report forms (CRFs) to electronic CRFs, the rapid growth in online data capture through EDC systems and the creation of integrated CDMSs and programmed edit checks at source. The digitization of clinical trials also opened the field of central monitoring and the emergence of RBQM as a key service, supported by new processes and technologies.

Core CDM processes have remained more conservative, but there is a strong appetite for innovation to use more advanced analytics, machine learning (ML) and natural language processing (NLP) within an RBQM framework. This drive for innovation in CDM coincides with a large increase in the number of data scientists available to apply new data techniques to the tasks of data standardization, discrepancy and reconciliation review and anomaly detection.

Data standardization underpins a lot of CDM activities.  An ongoing focus on the implementation of strong and consistent data standards for data collection is minimizing downstream data preparation. The residual effort required remains a human-intensive process, with specialists developing detailed specifications that map the source to a target such as Study Data Tabulation Model (SDTM) and programmers writing the mapping transformations based on the specifications. The process includes a base form-to-field mapping plus additional logic and controlled terminology mapping, usually embedded in the specification as pseudocode for the programmers to interpret. A combination of NLP and ML techniques can be used against a set of completed mapping specifications to train a model to provide the first pass of mapping for human completion and quality control.

The predictive models can determine the most probable form-field mappings and provide match probability scores. They can also parse and learn from previous pseudocode to provide a standard, metadata-driven set of suggestions for logic and value mapping. Additional ML could convert well-formed specifications into actual data transformation code. There are challenges with respect to the finalization and review of code and the handling of the inevitable post-production changes in data collection, but ML-assisted automated code generation may have a part to play in streamlining the application of data standards. 

Automating the process of data standardization speeds up the delivery of SDTM outputs and improves the governance of mapping specification, but perhaps more significantly, it facilitates a path to much more routine data standardization for operational purposes, which in turn opens lots of new possibilities for transforming CDM with clinical data science.

Core to the concept of RBQM is the identification and management of critical data. On that basis, a strategy can be implemented to clearly define the scope of data standardization for the purposes of best-practice CDM. Once that strategy is implemented, a standardized database for all studies can be established, a challenging task for contract research organizations (CROs), which provide services to a wide variety of sponsors.

The implementation of a standardized clinical data repository prompts new thinking and opens new possibilities for conducting CDM. Rather than follow the traditional path of creating listing specifications and programming outputs study by study, unsupervised ML techniques can be used to identify patterns in the data and recommend areas for follow up. The principle is that the data in a study reveal the study’s characteristics and that profile can be used to highlight anomalies or discrepancies.

Unsupervised ML can be used to detect where a data point might be missing, expected or overdue or where a data point deviates from the norm for that study. Some data captured in clinical data sets is inevitably free text. NLP can be used to conduct the free text reconciliations or cross checks currently performed by people, narrowing the focus of the human review effort to a much smaller set of true anomalies. Statistical techniques, combined with unsupervised ML, can be applied to numerical and categorical ranges to identify outliers and thereby significant anomalies not otherwise detectable by row-level reviews. Unsupervised ML can also be used to screen EDC audit trails for anomalous behavior.

A more radical or provocative notion is the potential for a paradigm shift whereby the current very granular approach to CDM is replaced by a more consolidated review of key data that combines a range of techniques to deliver a focused categorized data review report. The many individual reviews, backed up by specifications and programming, could be consolidated into a more holistic, standardized approach. It would still be necessary to create study-specific elements, but a lot of CDM could be significantly standardized across all studies. Digitalization is generally driven by the availability of data, the availability of data tools and expertise and commercial pressure. All those factors are steering CDM towards greater innovation and the emergence of a new clinical data science model.


Michael Phillips, PhD, has worked for ICON for approximately 10 years, and currently leads the Innovation Data Science Team. He has over 20 years’ experience in IT, business intelligence, data analytics, and eClinical innovation, accumulating strong experience in team leadership, business partnering, solution design and customer engagement. He has broader experience in academic biomedical research, with a PhD in drug metabolism enzymology, and spent 10 years working in management roles in biomedical publishing. Michael is the author of TIBCO Spotfire – A Comprehensive Primer.

Keep Up With Our Content. Subscribe To Contract Pharma Newsletters