Databases with genetic and medical data enable researchers to investigate diseases and analyse how genes and environment affect disease progression. These studies have allowed us to make inferences about a variety of topics, from the connection between diet and disease to the severity of COVID and household size, providing us with useful information to inform researchers, physicians, and patients alike.
Biobanks, however, are only as valuable as the volume and calibre of the data they contain. According to Stanford PhD student Lu Yang, incomplete information is frequently a problem in patient datasets. Yang explains that even while “we might know the patient has been treated for type II diabetes, for example,” if they have never received care as an in-patient at a hospital, the word “type II diabetes” may not be present in their records. For scientists conducting disease studies and seeking for trends that could result in novel discoveries, this missing information is a big hurdle.
Yang worked with Sheng Wang, a recent Stanford postdoctoral student, and Russ Altman, an associate director of Stanford HAI and professor of bioengineering, genetics, medicine, biomedical data science, and, by way of courtesy, computer science, to develop a model that can predict a complete set of diagnosis codes, also known as phenotype codes, for all the patients in the UK Biobank in order to solve this issue. A half million UK participants, including individuals with uncommon disorders, have their data stored in this bank. According to Yang, the study team developed a model that “produces probability that a person might have particular diseases or phenotypic codes” by using POPDx, a machine learning framework for disease recognition.
In actuality, POPDx surpasses current models in diagnosing both common and uncommon diseases, even those not represented in the training set. Altman thinks this is an important discovery. While the majority of deep learning algorithms necessitate extensive training, we were pleased that our method, which relied on prior information from text and taxonomy, enabled us to identify some diseases in our test set even though we had never encountered them in training. It is crucial that we develop methods that can work on sparse data and are effective enough to assist patients with rare diseases since, despite the fact that there is a lot of data in medicine, it is not at the same scale as giant IT businesses.
Actual Patient Information
Yang took into account Wang’s earlier work on the classification of cells before starting this study. In that study, Wang predicted a single accurate cell type for every test set cell using the Cell Ontology. For illnesses, Yang wished to adopt a similar strategy to that used for POPDx. I thought it would be nice to use the Human Disease Ontology’s links between diseases in a similar way to address disease recognition. Yang required numerous labels, but Wang’s research was a one-vs.-all classification challenge where only one cell type was predicted. One patient may have several different ailments, therefore we approached the issue as a multi-label, multi-classification type of issue, according to the expert.
The range of information that Yang utilised in her studies is another significant distinction. The POPDx model examines a wide range of patient data, including EHR data, demographic data, and patient surveys. Even physical data and laboratory tests can have their information extracted. According to her, “Before this, the majority of the current models required well-curated datasets, which meant they might not be able to look into the multitude of variables that we are able to look into with our work.” The vast array of disease codes the model could anticipate was directly related to the size of Yang’s work. “Typically, study will be focused on a certain field, such as heart illness, so they’ll just look at the pertinent data or codes. Nonetheless, in order to conduct our analysis, we attempted to create an exhaustive profile of all UK Biobank members.
Illness Prediction Despite Little Datasets
The POPDx model seeks associations between patient data and disease knowledge, employing probabilistic judgements based on natural language processing and the Human Disease Ontology. The disorders for which we have little or no data present the model with its greatest challenge. As is common knowledge, the majority of ML models require sizable datasets, however several of these diseases lack data, explains Yang.
It is unnecessary to use large datasets because POPDx performs admirably with little to no data. For invisible and rare diseases, Yang was able to increase the AUPRC (precision metric for the model) by 218% and 151%, respectively. In other words, if a clinical team has to uncover individuals with a low-prevalence condition, “our methodology on average will boost the likelihood of finding these positive instances,” according to Yang. Previously, they would have to comb through a sizable number of Biobank patients, but now they can screen a lot smaller amount to detect potential cases. The capacity of POPDx to identify uncommon diseases gives doctors and researchers interested in researching those disorders a better place to start.
Yang mentioned that one difficulty was the UK Biobank’s demographic bias, which is 56% female, primarily white, and has an average age of 71. The lack of diversity in the biobank, however, has more to do with widespread healthcare access than it does with data. The issue, according to Yang, is that we do not have someone’s data if they do not have access to healthcare. The researchers solved this issue by including baseline data on the relationships and hierarchies between diseases, which improved the model’s performance when dealing with new disorders. Yang thinks that this tactic may have also reduced bias and added some randomization to the model. Yang hopes that additional infrastructure will be available in the future to facilitate data integration across many biobanks, allowing for more varied datasets.
The Prediction of Illness in the Future
Yang is intrigued by a time-series analysis of the patient data since it would consider both the likelihood of a patient developing an illness and the potential timing of that occurrence. Another option is to incorporate genotype and phenotype data into the model, which would provide researchers with an even more thorough understanding of diseases than they already do. Yang is dedicated to creating inclusive models that are effective for everyone, no matter what comes next. Access to data is essential for everyone, Yang argues, including patients and researchers.