Clinically Speaking

Data Integrity

Deconstructing the why and how to achieve it.

Author Image

By: Ben Locwin

Contributing Editor, Contract Pharma

Subdivided to its root, “integrity” comes to us from the same Latin source as the word “entire.” In the pharmaceutical industry, however, integrity of data takes on an enormity of meaning beyond completeness or “entirety,” captured in the acronym ALCOA, which refers to data which are Attributable, Legible, Contemporaneous, Original, and Accurate; some users of “+” also add Complete and Enduring to the list, among other terms.

Here is a deconstruction of the elements:

Attributable
That the data generated are able to be connected to persons, systems, equipment, etc. which generated the data (the source of the data), as well as to a discrete point in time.

Example error source: If a password login is used for manufacturing recipes, one password can be used by multiple individuals, thus making the data non-attributable to a single operator.

Legible
Can the data be read and understood years after the capture? This could refer to certain pens (ink) or printers (thermal, e.g.) whose output degrades over time with environmental exposure. It can also refer to language euphemisms and syntax which are colorful and not straightforward, which can tend to obscure meaning with the passage of time.

Example error source: Certain printers can produce printouts which become faded or illegible even after mere weeks or months from their source. This makes the selection of data capture critical to not be in an uncomfortable position of having unreadable outputs several years after the fact.

Contemporaneous
Meaning connected or together in time, data need to be captured at the same time as their associated operation. Forward-dating (or timestamping) is strictly forbidden, as is back-dating for the very reason that the data were not contemporaneous with the record-keeping.

Example error source: Pre-populating batch record fields with information of something that will likely happen in the near future is a big source of error. Perhaps filter lot numbers are changed, thus rendering the initial entry incorrect. Because of the non-contemporaneous nature of this sort of activity, it is prohibited.

Original
Original records (lab notebooks, batch records, study reports, etc.) should be kept, rather than copies. Ability to enter system “backdoors” and modify data after it has been recorded is a transgression against Originality of data, and is forbidden.

Example error source: Production or lab notes written, which are done so with the purpose of transcribing later as the original, would be a transgression against Originality of source data.

Accurate
Data captured should reflect reality of what happened during a particular activity. Additionally, if there are changes made, the Accuracy clause requires that there is documentation of the changes which makes them traceable back to the original information to preserve the Accurate nature of the data. Electronic data capturing must occur within systems that have accuracy checks and verification controls. This is also one of the reasons that measurement equipment needs to be calibrated: To maintain an accurate source record of that which is measured.

Example error source: Deleting or obliterating data in order to capture something that changed in the operations violates this clause because the accuracy of the activities was not captured as-conducted.

Complete
One of the “+” elements added to the original ALCOA, all data should have an audit trail which shows any changes and that nothing was deleted or lost.
Example error source: Running a QC laboratory retest where a secondary result has the potential to overwrite or replace an original (initial) result obtained.

FDA Warning Letters on Data Integrity

The past 8 years have elicited the highest number of data integrity warning letters in history; Some of this is due to more scrutiny paid than ever before, but nevertheless, these issues exist and continue to grow in number.

One particular example from industry found that analysts at an analytical laboratory had system access to delete and overwrite data. The FDA investigators found approximately 36 deleted data files or folders in the recycle bin.

Further FDA Warning Letter examples to different organizations recently include the following1:
  • “Your firm failed to exercise appropriate controls over computer or related systems to assure that only authorized personnel institute changes in master production and control records, or other records (21 CFR 211.68(b)).”
  • “FDA analyses of study data generated at these companies and submitted in several applications…found significant instances of misconduct and violations of federal regulations, which resulted in the submission of invalid study data to FDA.”
  • “Your failure to retain study records as required by FDA regulations significantly compromises the validity and integrity of data collected at your site.”
  • “The batch record documented that one employee performed multiple manufacturing steps, such as measuring containers and bulk reconciliation on two separate dates, and a second employee documented the verification of the activities. However, the second employee (verifier) stated to our investigator that they were not at work when these steps were documented as being performed.”
  • “Your quality system has not adequately ensured the accuracy and integrity of the data to support the safety, effectiveness, and quality of the drugs you manufacture. Without accurate records, you cannot assure appropriate decisions regarding batch release, product stability, and other matters that are fundamental to the ongoing assurance of quality.”

Our World in Data

Our world is awash in more data than ever before in the history of civilization—by a large margin. It has been estimated that the creation, storage, and retrieval of data and information represents one-third of all revenue generated in the world. But herein lies the problem: If you rapidly scale-up the volume of data and its velocity, this often occurs without proper care paid to its veracity—the “truth” of the data. This fact, and also the enormity of terrible data analyses constantly performed on tortured data sets every day across the world, gave rise to the saying that we’re “drowning in data, but thirsting for information.”

When we do analyses of data which were improperly gathered, we begin to model the error within those data, which leads to and magnifies erroneous conclusions. In many industries, this can lead to simple decision-making errors. But in healthcare and pharma, misapplied analyses of the wrong data can lead to conclusions which ultimately lead to public harm. In clinical trials, for example, non-integrious data can lead directly to the commission of a Type 1 error or a Type 2 error, both of which would be invisible to the data analyst if the underlying integrity of the data was unknown.

How Big is Big Data in Healthcare?

According to estimates by IDC, 64 zettabytes (64 trillion gigabytes) of data were created or replicated in 2020, up from an estimated 45 zettabytes in 2019 and a forecasted 140 zettabytes in 2024. Of all these mountains of data, about 10% of the tagged data was considered useful for analysis or for being fed into AI/ML.

Now let’s contrast this with an interesting general review of business data published in Harvard Business Review, conducted by Nagle et al.: Only 3% of data reviewed in their study were rated as “acceptable,” (meaning 97% of the rest of the data fell on the other end of the scale (“unacceptable”).2 This means that AI models are frequently being trained on big data sets which are replete with unacceptable levels of errors in their data quality.

Artificial intelligence (AI) models most often derive incorrect conclusions because of a lack of data integrity. This shouldn’t be surprising: If machine learning algorithms live and die on the quality of their data sets, flaws in the data will necessarily sink their results. As we move to more-and-more AI/ML/DL analytics in healthcare data—including molecular discovery and development, clinical trial analyses, and so on—the integrity of those underlying data are and will continue to be the key to the future of healthcare.

Data without integrity are like a giftwrapped empty box—looks good on the outside, but has no value. Non-integrious data are really just numbers, and untrustworthy ones at that. The great theoretical physicist Wolfgang Pauli once lamented, “That is not only not right, it is not even wrong.”* What he was getting at was firstly that he was frustrated by careless and incorrect thinking, and also that there are different levels of wrongness—and being so wrong that one doesn’t even know or believe one is wrong is among the worst kinds of wrongness—the type of data wrongness that is unfalsifiable. And so, it is with data of unknown sources or provenances that get analyzed daily across the globe: Can we trust their results? That’s the ultimate Data Integrity question faced by the FDA and other international regulatory agencies. 

*Pauli’s original formulation was, “Das ist nicht nur nicht richtig, es ist nicht einmal falsch!”

References
  1. United States Food and Drug Administration. (2022). Warning Letters. https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/warning-letters
  2. Nagle, T., Redman, T.C., & Sammon, D. (2017). Only 3% of Companies’ Data Meets Basic Quality Standards. Harvard Business Review.


Ben Locwin
Contributing Editor

Ben Locwin is an executive and careful consumer of global data and analytics. Whether they be clinical trial analyses, drug development or production data, analytics intended for sociopolitical endeavors, or basic daily data and information forced upon us through media sources, the credibility and integrity of the data matter. We should expect more from our data, and we can help to ensure this by requiring and maintaining integrity of the sources of data and their collection methods. 

Keep Up With Our Content. Subscribe To Contract Pharma Newsletters