Features

The Benefits of Big Data in Drug Development

Big data is changing the way healthcare is delivered and improving the way drugs are developed.

By: Joe Alea

Chief Technology Officer, Anju

Drug development is a long and risky journey. According to industry group PhRMA, it takes 10-15 years on average to develop one new medicine from initial discovery through regulatory approval. Only one in 1,000 (0.1%) of drugs entering preclinical studies are ultimately tested in humans, and only one in five (20%) of drugs entering human trials make it to commercialization. While these numbers have improved, there is still much room for further improvement to optimize the drug development process.

Drug development is a complex and challenging process that requires significant technical expertise and resources. The intricacies in drug development are quite complex and pharmaceutical R&D suffers from declining success rates and stagnant pipelines. The processes in assessing the compounds at cellular and molecular levels are varied and require extensive permutations and combinations. Drug development challenges can include technical, regulatory, legal, clinical, financial, and cost and time challenges. These challenges can be difficult to overcome and can significantly delay or even prevent the successful launch of a new drug or therapy.

How does Big Data benefit drug development?

Big data is a term used to describe a large amount of data that is complex, large, and unstructured, requiring specific tools and algorithms to be processed and analyzed. Big data has become increasingly important in the healthcare industry as it can provide insights into drug discovery, drug development, patient behavior, treatment outcomes, and disease progression.

Big data is changing the way healthcare is delivered and improving the way drugs are developed. It has revolutionized the way drug development is done and has provided more accurate and efficient results.

The McKinsey Global Institute estimates that applying big-data strategies to better inform decision-making could generate up to $100 billion in value annually across the U.S. healthcare system, by optimizing innovation, improving the efficiency of research and clinical trials, and building new tools for physicians, consumers, insurers, and regulators to meet the promise of more individualized approaches.

Big data in drug discovery is the collection of data from biological, chemical, pharmacological and clinical sources. The attributes that define the characteristics of these datasets include: fast producing, large size, complex, heterogeneous and high value data with commercial opportunities.

Big data reduces the number of clinical trial cycles and applies predictive modeling during drug development. Big data analyzes the diversity of available molecular and clinical data and uses predictive modeling to identify new potential-candidate molecules with a high probability of being successfully developed into drugs that act on biological targets safely and effectively.

Big data has numerous benefits in drug development. It can help reduce the cost and time of drug development by providing more accurate and faster results. It can also improve the quality of the data used in drug development, allowing researchers to make more informed decisions. In addition, it can reduce the risk of drug development failure, since it can identify potential side effects before, they occur.

Big data can also help reduce the risk of clinical trial failure, since it can be used to identify the most effective treatments and to reduce the cost of clinical trials. Finally, it can help identify new drug targets and potential treatments for diseases that have not yet been studied.

Predictive modeling of biological processes and drugs has become significantly more sophisticated and widespread. By leveraging the diversity of available molecular and clinical data, predictive modeling could help identify new potential-candidate molecules with a high probability of being developed into drugs that act on biological targets safely and effectively.

Using a big data platform, clinical trials can be monitored in real-time to rapidly identify safety or operational signals requiring action to avoid significant and potentially costly issues such as adverse events and unnecessary delays. Sponsors use sophisticated business intelligence software solutions such as TA Scan. These valuable technologies aggregate, connect, and analyze global clinical trial data, presentation data, publication data, and many other data sources from the public domain into a single database.

Drug development is a long and rigorous process. In driving pharmaceutical production forward in a shorter amount of time, big data has helped reduce some of the skyrocketing costs and drawbacks associated with clinical trials.
JAMA reports the estimated research and development spend for a new drug is $1.1 billion which includes the costs for failed clinical trials.

When designing a clinical trial, several factors to take into consideration, the most important being to recruit the correct candidates for your study. This is certainly a time-consuming element, but big data can help as it allows researchers to leverage genetic makeup, disease status, historic patient data, demographics, past clinical trial data, and much more to find the right participants. This saves both time and money and maximizes the study’s success rate and, in some cases, minimizes the need for clinical trials altogether.


Figure 1. Current Drug Development Process (Source: IEEE Engineering in Medicine and Biology Society)

Emerging role of automation, analytics, and AI in drug development

The emerging deep learning AI techniques along with insights from advanced analytics provide compelling advantages for the drug development industry.

There are a variety of big data technologies that are being used in drug development. These include machine learning algorithms, natural language processing, and predictive analytics. Machine learning algorithms can be used to identify patterns in the data and to generate hypotheses about the effectiveness of treatments. Natural language processing can be used to analyze unstructured data, such as patient records and medical reports.

Predictive analytics can be used to identify potential side effects and to optimize the design of drug molecules. Finally, big data technologies can be used to analyze clinical trial data and to identify patterns that can be used to improve the success rate of clinical trials.

Traditionally, drug development has been rife with error, specifically, a high occurrence of failure across clinical trials. Research shows that the overall probability of clinical success is estimated to be less than 12%, and less than 10% of drug candidates make it to market following Phase I trials. Oftentimes, Phase 1 trial failure can be attributed to the unexpected toxicity of drug candidates.

AI with ML-based analytics coupled with deep learning can study compound libraries and predict molecular behavior/interaction between compounds. AI can also identify patterns and insights in an increasingly accelerated time frame. Understandably, this process streamlines the initial phases of drug discovery and proactively identifies adverse reactions.

In addition, AI along with ML, and Data Analytics can help analyze vast amounts of data to predict toxicity at the very beginning of the process, filter out any molecules that are potentially toxic, and validate new drugs more efficiently. They can do so by computing various permutations and combinations of molecular data and providing a less toxic combination.

Basically, with an AI-based approach, pharma companies can go from research to a working molecular lead faster. Beyond initial discovery, AI can be used for drug target identification and validation, drug repurposing, and more targeted drug design and discovery. Moreover, leveraging AI during preclinical development could help trials run smoothly and enable researchers to more quickly and successfully predict how a drug might interact with models using animals.

Analytics along with AI/ML technology can also be applied to analyze existing drugs, including effects on the body and side effects, to inform potential drug repurposing opportunities. Through sophisticated data modeling, companies can run existing drugs through ‘AI drug repurposing platforms’ to determine new medical applications. Notably, existing drugs have already been passed through regulatory approval processes, and, therefore, the approval of repurposed drugs is streamlined.

This is especially applicable to the global healthcare industry’s current challenges, as pharmaceutical companies are working to bring cancer vaccines to market after the success of the COVID vaccine. Various studies are predicting AI has the potential to provide over $70 billion in savings for the drug discovery process by 2028.

Polluted data lakes — implications for AI in pharma

Data lakes are large repositories of data collected from diverse sources. While data lakes can be extremely useful for drug discovery and development, they can also become polluted if not properly maintained. In any data lake, the vastness of Life Sciences data makes it impossible for individuals to take on the challenge of analyzing and interpreting these data using conventional methods.

What happens when a data lake becomes “polluted”? What happens when missing or improperly entered data skews AI models, or when bad actors/incorrect data ingestion introduces malicious information into data?

It doesn’t matter if it’s a single string in a malicious data file or inconsistencies in how data are entered—it all adds up to data lake pollution.

Data corruption can completely spoil AI models and prediction results, costing pharma companies millions of dollars and making their AI and machine learning unreliable.  Flawed data quality can ultimately derail the outcome of the clinical trial. As such it is essential to have clean curated data before any analytics and insight engines can interpret the data.

This has prompted software Engineers to create a new category of technology known as “Data Observability.” These data problems underscore the need for data observability from the very earliest stages in creating a data lake.

Data quality monitoring and data observability must be able to detect quality defects as early in the data transformation process as possible. Schema within the data supply chain can change over time. If those changes are not caught in advance, it can lead to downstream analytical production issues, with subtle yet fundamentally flawed analytics.

Data observability is vast and has a range of possible applications. It is vitally important to monitor data for quality metrics before that data is extracted and loaded into the data lake. If it is combined with other data in the ETL process, that could lead to the introduction of upstream quality defects and malicious payloads. The more complicated the ETL process, the more difficult it becomes to detect and enforce quality standards.

One should not wait to monitor data quality after ETL processes have been executed. At that point, the data lake is already polluted.

Be vigilant about missing data. As mentioned earlier, large datasets can have considerable amounts of missing or null fields. When data, specifically training data for AI, starts to contain an unexpected number of missing values, this will almost certainly affect the performance and outcomes of the AI model.  

Establish threshold benchmarks that will identify missing values outside a specific minimum or maximum. If one can minimize how much missing data unintentionally ends up in a data lake, it reduces the possibility of errors when training accurate AI models.

AI has the promise to unlock discoveries in vast amounts of data that individuals alone could never manage. But if data lakes are polluted, whether unintentionally or maliciously, the promise of AI may not be realized, and the process of discovery may take a step backward. Data observability is a key element to preventing data lake pollution, and the first step to any successful implementation of AI. 

Causes of Data Lake Pollution
•  Data volume deficiencies
•  Potentially corrupted data
•  Duplicate data
•  Incomplete data
•  Incorrect Data imputation
•  Data arriving late after transformation and curation is completed
•  Null Data


Joe Alea is Chief Technology Officer at Anju where he leads product development and services. He has over 30 years of clinical, healthcare, and data analytics experience. Before joining Anju, Joe served in multiple executive roles, including Global VP of Development at Oracle/Phase Forward, and he served as Chief Technology Officer of Clinical Ink.

Keep Up With Our Content. Subscribe To Contract Pharma Newsletters