Cancer is one of the leading causes of death worldwide. For the tens of millions diagnosed annually with cancer, early detection could mean the difference between life and death. Currently, a patient’s risk of cancer is assessed using simplistic models that do not properly account for the many factors associated with cancer development. There are also no existing models that predict the time before a patient is first diagnosed with cancer; instead, many existing models focus on the possibility of cancer recurring after the first diagnosis. A study by undergraduate researcher Kien Lau (YC ’27) at the Yale Smart Medicine Lab aims to revolutionize the way early cancer is detected.
Although Lau began his first year at Yale with little foundation in computational science, through online videos, he developed not only an understanding of methods but also an interest in cancer-predicting models. Lau began his project by examining three possible models: the statistics-based Cox proportional hazards model, the machine learning-based survival decision tree model, and the machine learning-based random survival forest model. While statistical models provide an interpretable analysis of the data given, machine learning models automate the interpretation process and directly convert the data into a decision.
At first, Lau believed the two advanced machine learning models would be most promising; however, they remain inaccessible and difficult to trust for most clinicians, as results are difficult to interpret and lack transparency in the data used. Furthermore, sample groups from which data were collected for these models may lack diversity and therefore underperform for certain populations. The most prominent weakness of existing models is the absence of consideration for variables that change with time. “A lot of existing machine learning models use static predictions such as age, sex, body mass index […] which identify only the probability of getting cancer at a moment,” Lau said. Instead of static predictions, he aimed to develop a time-to-first cancer diagnosis model that accounts for the changing variables of an individual through their clinical visits.
Unexpectedly, the Cox model became the ideal foundation for building new time-to-event prediction model. As a statistical model, Cox is more traditional, but it is ideal for incorporating time-dependent variables leading up to a diagnosis. Lau modified the Cox model
to include elastic-net regularization, which takes into account variables that may affect other variables. With other modifications, Lau created a new model, which he began training.
The new model accounted for forty-six features identified from the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial, which included the data of more than one hundred thousand participants. Separate Cox models were created for each type of cancer. After creating the new model, validation was done by testing the model on the UK Biobank, a different data set including the health, lifestyle, and demographic information of more than half a million participants.
To assess the reliability of the model and its ability to predict the time until the first cancer diagnosis, three indices were tested. The C-index compares two individuals and identifies the likelihood of the model to correctly predict which one is diagnosed first with cancer. The area under the receiver operating characteristic curve takes into account the ability of the model to predict diagnoses as time passes and ensures the model remains strong predictively as new information is input. The lift metric identifies how much more likely individuals with certain characteristics are to develop cancer compared to the average individual.
Lau’s modified Cox model outperformed the machine learning models and many existing models. When compared to other prediction models, this new model is particularly more reliable in predicting lung cancer. The model also revealed some unexpected correlations. Specifically, the model identified that individuals with a lower body mass index (BMI) are more likely to develop lung cancer, even though higher BMIs are usually associated with many other cancers. However, smoking could be the variable that explains this relationship, as smoking leads to lower BMIs but increased risk of cancer. Overall, the model considers how variables could affect each other, not just how variables relate to the development of cancer.
“This specialized model could be easily implemented in the medical system to improve the detection of early cancers,” Lau said. His work provides a reliable time-to-first cancer diagnosis model that clinicians can easily interpret. As cancer continues to remain unpredictable, this new prediction model may help save the lives of millions.