Nature Scientific Reports publishes novel AI technique that can automatically detect poor-quality medical data and improve healthcare AI and data privacy

A recent paper in the journal Nature Scientific Reports shows that Artificial Intelligence (AI) can automatically detect errors in poor-quality medical data, even errors that are difficult for medical experts to manually identify.

The innovative algorithm, called UDC, was developed by AI Healthcare company Presagen. It was applied to two medical problems in radiology and IVF, and two non-medical problems, detection of mis-labeled cats and dogs, and vehicle types. In all cases the algorithm was able to effectively detect which individual data had errors or quality issues. When the poor-quality data was removed, the ‘clean’ data allowed the development of AI with greater accuracy and scalability, which are critical for commercial AI applications.

The algorithm was one of the core technologies used to develop Presagen’s first healthcare product Life Whisperer, which uses AI to assess images of embryos in IVF to help improve pregnancy outcomes for couples. In a published international clinical study Life Whisperer was shown to perform 25% better than current manual embryo assessment methods. Life Whisperer is currently being used by IVF clinics globally.

Dr Michelle Perugini, Presagen Co-Founder and CEO said “Real world problems like healthcare are not Kaggle competitions. Medical data are inherently poor quality due to clinical subjectivity, uncertainty, and even adversarial attacks where data contributors intentionally contribute poor-quality data. It is not always possible to reliably detect errors in data, even by experts. We have seen that even 1% poor-quality data can impact AI performance. This ground-breaking technique can automatically detect poor-quality data and allows us to build robust commercial AI products that can be used reliably.”

The UDC algorithm demonstrated a range of additional benefits.

Dr Jonathan Hall, Presagen’s Co-Founder and Chief Scientist said “A major benefit of the algorithm is data privacy. The algorithm can automatically detect errors without the need for manual visual verification of private patient data. The UDC was also shown to detect errors in benchmark or ‘test’ data used to validate the performance of AI. Benchmark datasets can contain dormant errors, and thus testing AI on these datasets can mislead users to actual performance of the AI.”

When applied to images of x-rays to detect pneumonia, the UDC found several x-ray images to be neither correct nor an error, but generally poor quality or lacking suitable features for diagnosis. Verification of these images by an independent radiologist also agreed that they were indeed difficult images to diagnose.

Removal of these poor-quality (difficult) images identified by UDC improved AI accuracy for diagnosing pneumonia in x-rays images by over 10%, and the AI was shown to be more scalable (generalizable). The accuracy also exceeded benchmarks set by the current literature for that public dataset.

Results suggest these poor-quality x-ray images identified by the UDC are uninformative, counter-productive, or confusing when used in training AI. The ability to identify when new images are poor-quality is important to prevent an inaccurate AI clinical assessment, but also to alert the radiologist when the scan is likely to be difficult to diagnose or when a new scan should be taken.

UDC-NatureScientificReports Sept 2021-1.png

Download full paper

Paper Title

Automated Detection of Poor-Quality Data: Case Studies in Healthcare https://www.nature.com/articles/s41598-021-97341-0

Authors

M.A. Dakka, T.V. Nguyen, J.M.M. Hall, S.M. Diakiw, M. VerMilyea, R. Linke, M. Perugini, and D. Perugini

Paper Abstract

The detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models. In healthcare, data can be inherently poor-quality due to uncertainty or subjectivity, but as is often the case, the requirement for data privacy restricts AI practitioners from accessing raw training data, meaning manual visual verification of private patient data is not possible. Here we describe a novel method for automated identification of poor-quality data, called Untrainable Data Cleansing. This method is shown to have numerous benefits including protection of private patient data; improvement in AI generalizability; reduction in time, cost, and data needed for training; all while offering a truer reporting of AI performance itself. Additionally, results show that Untrainable Data Cleansing could be useful as a triage tool to identify difficult clinical cases that may warrant in-depth evaluation or additional testing to support a diagnosis.

About Presagen

Presagen is an AI healthcare company that is changing the way clinics, patients, and medical data from around the world are connected through AI. Its platform, The Social Network for Healthcare, connects clinics and patients globally, and enables collaboration and data sharing to create scalable AI healthcare products that are affordable and accessible for all. The decentralized network democratizes the creation of AI products, promotes collaboration through incentives, and protects data privacy and ownership. With a focus on improving Women’s Health outcomes globally, Presagen’s first product, Life Whisperer, is being used by IVF clinics globally to improve pregnancy outcomes for couples struggling with fertility. With a vision of creating the largest network of clinics, patients, and medical data from around the world, Presagen is driving the future of AI Enhanced Healthcare.

Guest UserSeptember 10, 2021