Explore recent issues of Contract Pharma covering key industry trends.
Read the full digital version of our magazine online.
Stay informed! Subscribe to Contract Pharma for industry news and analysis.
Get the latest updates and breaking news from the pharmaceutical and biopharmaceutical industry.
Discover the newest partnerships and collaborations within the pharma sector.
Keep track of key executive moves and promotions in the pharma and biopharma industry.
Updates on the latest clinical trials and regulatory filings.
Stay informed with the latest financial reports and updates in the pharma industry.
Expert Q&A sessions addressing crucial topics in the pharmaceutical and biopharmaceutical world.
In-depth articles and features covering critical industry developments.
Access exclusive industry insights, interviews, and in-depth analysis.
Insights and analysis from industry experts on current pharma issues.
A detailed look at the leading US players in the global pharmaceutical and BioPharmaceutical industry.
Browse companies involved in pharmaceutical manufacturing and services.
Comprehensive company profiles featuring overviews, key statistics, services, and contact details.
A comprehensive glossary of terms used in the pharmaceutical and biopharmaceutical industry.
Watch in-depth videos featuring industry insights and developments.
Listen to expert discussions and interviews in pharma and biopharma.
Download in-depth eBooks covering various aspects of the pharma industry.
Access detailed whitepapers offering analysis on industry topics.
View and download brochures from companies in the pharmaceutical sector.
Explore content sponsored by industry leaders, providing valuable insights.
Stay updated with the latest press releases from pharma and biopharma companies.
Explore top companies showcasing innovative pharma solutions.
Meet the leaders driving innovation and collaboration.
Engage with sessions and panels on pharma’s key trends.
Hear from experts shaping the pharmaceutical industry.
Join online webinars discussing critical industry topics and trends.
A comprehensive calendar of key industry events around the globe.
Live coverage and updates from major pharma and biopharma shows.
Find advertising opportunities to reach your target audience with Contract Pharma.
Review the editorial standards and guidelines for content published on our site.
Understand how Contract Pharma handles your personal data.
View the terms and conditions for using the Contract Pharma website.
What are you searching for?
The problem of reproducibility, potential data problems for AI, and simple starting points for data observability.
July 1, 2022
By: David Hirko
Founder and principal of Zectonal
For at least five years already, the life sciences industry has been exuberant about the potential for faster advances through the application of artificial intelligence (AI) on so-called “data lakes.” More than simply data warehouses, data lakes contain data from a wide range of sources – from research to trials, to post-launch information from both doctors and patients. Of course, the vastness of such data makes it impossible for single individuals to take on the challenge of analyzing and interpreting this data using conventional methods. This has caused an explosion in the area of AI technology, and it is in the use of AI that excitement about the potential for data lakes in life sciences has really skyrocketed. But what happens when a data lake becomes “polluted”? What happens when missing or improperly entered data skews AI models, or when bad actors introduce malicious information into data? It doesn’t matter if it’s a single string in a malicious data file or just innocent inconsistencies in how data are entered – it adds up to data lake pollution. It can completely throw off-kilter how AI models information, costing companies millions of dollars and making their AI and machine learning unreliable. The problem of reproducibility To put a finer point on the problem of polluted data lakes, let’s consider the example of clinical trials. Trials of emerging treatments in the pharmaceutical industry can be greatly accelerated by using larger datasets, and those datasets can be gathered in data lakes across many different research organizations. According to at least one software manufacturer, this volume of data, interpreted through AI modeling, could potentially cut development costs and time to market by as much as 30%. Imagine, however, how seriously clinical trials could be derailed because of small but significant data quality problems compounded over time. Reliability and reproducibility are essentially to drawing confident conclusions about data, and that has already emerged as a problem in life sciences because of incomplete data. Late in 2021, studies from cancer biologists at the Center for Open Science determined that 59% of experiments across 23 studies could not be replicated, due in large part to missing or unavailable data. Extend that problem across multiple sources of data in a data lake, and it becomes clear that the industry may have to place less confidence in AI modeling of data than the technology had initially promised. This has prompted the software and IT industry to create a new category of technology known as “data observability.” Because this technology has emerged so quickly, however, most players in the field have somewhat of a backwards-looking approach to the problem. Technology can often reveal problems with data lake quality once the data has been compiled, but this is akin to closing the barn door after the horse has already escaped. Fewer companies are able to do deep insight into aspects of data as it is being introduced into the data lake, but that’s precisely where it’s important to draw conclusions – before AI modeling can be called into question. Potential data problems for AI There are several ways in which a data lake can become polluted. These underscore the need for data observability from the very earliest stages in creating a data lake: Data volume deficiencies. Data pipelines and batch Extract-Transform-Load (ETL) processes are typically consistent in the number of files produced on an hourly, daily, or even monthly basis. Monitoring the file counts over these windows is a reasonably simple way to determine the health of your flows. If you know whether a specific number of files are missing over a measured time period, it becomes easier to understand whether the entire volume of expected data has been received. If fewer files are received than had been expected, that indicates a problem upstream in the data supply chain. On the other hand, getting considerably more data than expected could mean that data has been duplicated.
Enter your account email.
A verification code was sent to your email, Enter the 6-digit code sent to your mail.
Didn't get the code? Check your spam folder or resend code
Set a new password for signing in and accessing your data.
Your Password has been Updated !