Expert’s Opinion

How Polluted Data Lakes Threaten AI in Life Sciences

The problem of reproducibility, potential data problems for AI, and simple starting points for data observability.

By: David Hirko

Founder and principal of Zectonal

For at least five years already, the life sciences industry has been exuberant about the potential for faster advances through the application of artificial intelligence (AI) on so-called “data lakes.” More than simply data warehouses, data lakes contain data from a wide range of sources – from research to trials, to post-launch information from both doctors and patients.
 
Of course, the vastness of such data makes it impossible for single individuals to take on the challenge of analyzing and interpreting this data using conventional methods. This has caused an explosion in the area of AI technology, and it is in the use of AI that excitement about the potential for data lakes in life sciences has really skyrocketed.
 
But what happens when a data lake becomes “polluted”? What happens when missing or improperly entered data skews AI models, or when bad actors introduce malicious information into data? 
 
It doesn’t matter if it’s a single string in a malicious data file or just innocent inconsistencies in how data are entered – it adds up to data lake pollution. It can completely throw off-kilter how AI models information, costing companies millions of dollars and making their AI and machine learning unreliable.
 
The problem of reproducibility
To put a finer point on the problem of polluted data lakes, let’s consider the example of clinical trials. Trials of emerging treatments in the pharmaceutical industry can be greatly accelerated by using larger datasets, and those datasets can be gathered in data lakes across many different research organizations. According to at least one software manufacturer, this volume of data, interpreted through AI modeling, could potentially cut development costs and time to market by as much as 30%.
 
Imagine, however, how seriously clinical trials could be derailed because of small but significant data quality problems compounded over time. Reliability and reproducibility are essentially to drawing confident conclusions about data, and that has already emerged as a problem in life sciences because of incomplete data. 

Late in 2021, studies from cancer biologists at the Center for Open Science determined that 59% of experiments across 23 studies could not be replicated, due in large part to missing or unavailable data. Extend that problem across multiple sources of data in a data lake, and it becomes clear that the industry may have to place less confidence in AI modeling of data than the technology had initially promised. 

This has prompted the software and IT industry to create a new category of technology known as “data observability.” Because this technology has emerged so quickly, however, most players in the field have somewhat of a backwards-looking approach to the problem.
 
Technology can often reveal problems with data lake quality once the data has been compiled, but this is akin to closing the barn door after the horse has already escaped. Fewer companies are able to do deep insight into aspects of data as it is being introduced into the data lake, but that’s precisely where it’s important to draw conclusions – before AI modeling can be called into question.

Potential data problems for AI
There are several ways in which a data lake can become polluted. These underscore the need for data observability from the very earliest stages in creating a data lake:
Data volume deficiencies. Data pipelines and batch Extract-Transform-Load (ETL) processes are typically consistent in the number of files produced on an hourly, daily, or even monthly basis. Monitoring the file counts over these windows is a reasonably simple way to determine the health of your flows. 

If you know whether a specific number of files are missing over a measured time period, it becomes easier to understand whether the entire volume of expected data has been received. If fewer files are received than had been expected, that indicates a problem upstream in the data supply chain. On the other hand, getting considerably more data than expected could mean that data has been duplicated.

Potentially corrupted data. Structured data schema should be monitored for compliance. Receiving data where the schema is not as expected (such as extra columns due to formatting errors) can lead to significant operational problems.

Data quality monitoring and data observability must be able to detect quality defects as early in the data transformation process as possible. Schema within the data supply chain can change over time. If those changes are not caught in advance, it can lead to downstream analytical production issues, with subtle yet fundamentally flawed analytics. 

Incomplete data. Data can often be densely packed, with millions of rows and hundreds of columns – in other words, potentially billions of “data points” per file.  

Such densely packed data can contain empty or null values, which can wreak havoc on machine learning models. Hundreds of such sparsely populated datasets, each with billions of data points, will absolutely lead to changes in training data models. That then can drastically affect analytical outcomes and predictions.

It is vital to be able to detect how many null or empty values can be expected in dense files. That requires users to establish a known baseline of such problem values as a means by which to monitor performance. If a null value exceeds a set percentage of total data content, that should be a red flag that a significant problem is occurring.

Duplicative data. By some estimates, companies without data quality initiatives in place can suffer from data duplication rates of 10% to as much as 30%.

Storing duplicate data can be expensive, and can lead to biased outcomes when used for analysis or machine learning training initiatives. Knowing if your data is duplicative can cut operational costs dramatically.

Late data. While more common in fields other than life sciences, it is important to understand whether data is being received later than expected. Late data can become a real problem if organizations have already conducted transformations, aggregations, and analysis already in the absence of the late data. 
 
For on-going training of machine learning models, late data could result in having to retrain models, which is obviously time consuming process that should be avoided if at all possible.

Simple starting points for data observability
So, what is to be done to limit the chances of data lake pollution? How can data observability help?
 
The technology underlying data observability is vast and has a range of possible applications. To simply matters, there are two things that are of utmost importance:
Start before implementing the extraction process. As described earlier, the most fundamental process that occurs to data before being ingested into a data lake is ETL. The process can become more complex when it is used to combine multiple data sets into one.

It is important to monitor data for quality metrics (or embedded threats) before that data is extracted and loaded into the data lake. If it is combined with other data in the ETL process, that could lead to introduction of upstream quality defects and malicious payloads. The more complicated the ETL process, the more difficult it becomes to detect and enforce quality standards.

Don’t wait to monitor data quality after ETL processes have been executed. At that point, the data lake is already polluted. 

Be vigilant about missing data. As mentioned earlier, large datasets can have considerable amounts of missing or empty fields. When data, specifically training data for AI, starts to contain an unexpected number of missing values, this will almost certainly affect the performance and outcomes of the AI model.   

Establish threshold benchmarks that will identify missing values outside a specific minimum or maximum. By minimizing how much missing data unintentionally ends up in a data lake, it reduces the possibility of errors when training accurate AI models. 

AI has the promise to unlock discoveries in vast amounts of data that individuals alone could never manage. But if data lakes are polluted, whether unintentionally or maliciously, the promise of AI may not be realized, and the process of discovery may actually take a step backwards. Data observability is a key element to preventing data lake pollution, and the first step to any successful implementation of AI.
 
Dave Hirko is founder and principal of Zectonal. He can be reached at dave@zectonal.com

Keep Up With Our Content. Subscribe To Contract Pharma Newsletters