Machine learning is taking medical diagnosis by storm. From eye disease, breast and other cancers, to more amorphous neurological disorders, AI is routinely matching physician performance, if not beating them outright.
Yet how much can we take those results at face value? When it comes to life and death decisions, when can we put our full trust in enigmatic algorithms—“black boxes” that even their creators cannot fully explain or understand? The problem gets more complex as medical AI crosses multiple disciplines and developers, including both academic and industry powerhouses such as Google, Amazon, or Apple, with disparate incentives.
This week, the two sides battled it out in a heated duel in one of the most prestigious science journals, Nature. On one side are prominent AI researchers at the Princess Margaret Cancer Centre, University of Toronto, Stanford University, Johns Hopkins, Harvard, MIT, and others. On the other side is the titan Google Health.
The trigger was an explosive study by Google Health for breast cancer screening, published in January this year. The study claimed to have developed an AI system that vastly outperformed radiologists for diagnosing breast cancer, and can be generalized to populations beyond those used for training—a holy grail of sorts that’s incredibly difficult due to the lack of large medical imaging datasets. The study made waves across the media landscape, and created a buzz in the public sphere for medical AI’s “coming of age.”
The problem, the academics argued, is that the study lacked sufficient descriptions of the code and model for others to replicate. In other words, we can only trust the study at its word—something that’s just not done in scientific research. Google Health, in turn, penned a polite, nuanced but assertive rebuttal arguing for their need to protect patient information and prevent the AI from malicious attacks.
Academic discourse like these form the seat of science, and may seem incredibly nerdy and outdated—especially because rather than online channels, the two sides resorted to a centuries-old pen-and-paper discussion. By doing so, however, they elevated a necessary debate to a broad worldwide audience, each side landing solid punches that, in turn, could lay the basis of a framework for trust and transparency in medical AI—to the benefit of all. Now if they could only rap their arguments in the vein of Hamilton and Jefferson’s Cabinet Battles in Hamilton.
Academics, You Have the Floor
It’s easy to see where the academic’s arguments come from. Science is often painted as a holy endeavor embodying objectivity and truth. But as any discipline touched by people, it’s prone to errors, poor designs, unintentional biases or—in very small numbers—conscious manipulation to skew the results. Because of this, when publishing results, scientists carefully describe their methodology so others can replicate the findings. If a conclusion, say a vaccine that protects against Covid-19, happens in nearly every lab regardless of the scientist, the material, or the subjects, then we have stronger proof that the vaccine actually works. If not, it means that the initial study may be wrong—and scientists can then delineate why and move on. Replication is critical to healthy scientific evolution.
But AI research is shredding the dogma.
“In computational research, it’s not yet a widespread criterion for the details of an AI study to be fully accessible. This is detrimental to our progress,” said author Dr. Benjamin Haibe-Kains at Princess Margaret Cancer Centre. For example, nuances in computer code or training samples and parameters could dramatically change training and evaluation of results—aspects that can’t be easily described using text alone, as is the norm. The consequence, said the team, is that it makes trying to verify the complex computational pipeline “not possible.” (For academics, that’s the equivalent of gloves off.)
Although the academics took Google Health’s breast cancer study as an example, they acknowledged the problem is far more widespread. By examining the shortfalls of the Google Health study in terms of transparency, the team said, “we provide potential solutions with implications for the broader field.” It’s not an impossible problem. Online depositories such as GitHub, Bitbucket, and others already allow the sharing of code. Others allow sharing of deep learning models, such as ModelHub.ai, with support for frameworks such as TensorFlow, which was used by the Google Health team.
Ins-and-outs details of AI models aside, there’s also the question of sharing data that those models were trained from. It’s a particularly thorny problem for medical AI, because much of those datasets are under license and sharing can generate privacy concerns. Yet it’s not unheard of. For example, genomics has leveraged patient datasets for decades—essentially each person’s genetic “base code”—and extensive guidelines exist to protect patient privacy. If you’ve ever used a 23andMe ancestry spit kit and provided consent for your data to be used for large genomic studies, you’ve benefited from those guidelines. Setting up something similar for medical AI isn’t impossible.
In the end, a higher bar for transparency for medical AI will benefit the entire field, including doctors and patients. “In addition to improving accessibility and transparency, such resources can considerably accelerate model development, validation and transition into production and clinical Implementation,” the authors wrote.
Google Health, Your Response
Led by Dr. Scott McKinney, Google Health did not mince words. Their general argument: “No doubt the commenters are motivated by protecting future patients as much as scientific principle. We share that sentiment.” But under current regulatory frameworks, our hands are tied when it comes to open sharing.
For example, when it comes to releasing a version of their model for others to test on different sets of medical images, the team said they simply can’t because their AI system may be classified as “medical device software,” which is subject to oversight. Unrestricted release may lead to liability issues that place patients, providers, and developers at risk.
As for sharing datasets, Google Health argued that their largest source used is available online with application to access (with just a hint of sass that their organization helped to fund the resource). Other datasets, due to ethical boards, simply cannot be shared.
Finally, the team argued that sharing a model’s “learned parameters,”—that is, the bread-and-butter of how they’re constructed—can inadvertently expose the training dataset and model to malicious attack or misuse. It’s certainly a concern: you may have previously heard of GPT-3, the OpenAI algorithm that writes unnervingly like a human—enough to fool Redditors for a week. But it would take a really sick individual to bastardize a breast cancer detection tool for some twisted gratification.
The Room Where It Happens
The academic-Google Health debate is just a small corner of a worldwide reckoning for medical AI. In September 2011, an international consortium of medical experts introduced a set of official standards for clinical trials that deploy AI in medicine, with the goal of plucking out AI snake oil from trustworthy algorithms. One point may sound familiar: how reliably a medical AI functions in the real word, away from favorable training sets or conditions in the lab. The guidelines represent some of the first when it comes to medical AI, but won’t be the last.
If this all seems abstract and high up in the ivory tower, think of it another way: you’re now witnessing the room where it happens. By publishing negotiations and discourse publicly, AI developers are inviting additional stakeholders to join in on the conversation. Like self-driving cars, medical AI seems like an inevitability. The question is how to judge and deploy it in a safe, equal manner—while inviting a hefty dose of public trust.