Circa Diem: Opening the AI Black Box

Art courtesy of Kassi Correia

The fact that the Earth rotates around its axis once every 86,400 seconds seems like a faraway explanation for the passage of time, but what if this simple concept actually relates to the most important physiological and behavioral processes in our bodies? Our internal circadian rhythm is a twenty-four-hour biological clock that influences everything from our sleep cycle and metabolism to our immune system and susceptibility to disease. Understanding the gene expression that underlies such a fundamental adaptation for life poses many challenges for scientists, but modern artificial intelligence (AI) algorithms and machine learning (ML) models provide new avenues into exploring such scientific questions. A team of researchers at the Earlham Institute in Norwich, England recently conducted a study to increase the transparency of how ML systems work, while also shining light onto the most advanced computational system we know of: the human brain. 

Circadian rhythms depend on many factors, including environmental stimuli like light and temperature. This is one of the reasons why changing time zones can cause us to experience jet lag—a misalignment between our body’s expectation of the day-night cycle and the changing cues presented by a new geographical location. It has been experimentally determined that these circadian rhythms are controlled by the expression of specific genes that oscillate between on-off states during the twenty-four-hour intervals. However, past efforts to detect this circadian rhythmicity have required the generation of long, high-resolution time-series datasets, an effort that is expensive, inefficient, and time-consuming. To work with such large amounts of data, the researchers took a new approach, involving a combination of AI and ML algorithms, to predict circadian gene expression. 

Hussien Mohsen, a researcher in the Gerstein Lab at Yale who was not involved in the study, further explained the intersection between artificial intelligence and gene expression research. Mohsen focuses on interpretable machine learning for cancer genomics—a field where, as in the circadian rhythm field, there has been increasing interest in deep learning algorithms (a subset of machine learning) in recent years. According to Mohsen, this is particularly due to technological advancements, which allow us to generate the immense archive of data that lies at the heart of deep learning. “Interpretability of machine learning has become way more popular with deep learning for that particular reason: because you have enormous amounts of data,” Mohsen said. “The models become so incredibly complex that we need to simplify them—our human cognition can’t really follow what’s going on.”

When it comes to applying these data analysis tools to the field of biology, scientists must ensure that AI techniques are simultaneously efficient and reliable so that the results generated can be applied to the whole population being studied. In computing, the “black box” refers to systems that are considered only in terms of their inputs and outputs, with no real understanding of their inner workings. As powerful as AI algorithms are for navigating increasingly complex issues, this lack of transparency raises concerns for future research: how is the model transforming data into results? How are the ML algorithms making decisions based only on pattern identification? And if there are any issues, how would we know? 

To this end, in their study of circadian rhythms, the Earlham Institute researchers formulated an approach involving three key elements: 1) developing ML models that quantify the best transcriptomic timepoints for sampling large gene sequencing datasets while reducing the overall number of timepoints required; 2) redefining the field by using only DNA sequence features rather than transcriptome time point information; and 3) decoding the “black box” of ML models to explain the mechanism of how AI is used to predict circadian clock function.

In order to effectively analyze the expression of circadian rhythms, the researchers chose the small flowering plant Arabidopsis thaliana as a model organism. Arabidopsis was the first plant to have its entire genome sequenced, and because some of its regulatory elements were already known, the researchers used that pre-existing knowledge to validate their ML predictions. This allowed them to understand how their ML model was reaching its predictions, thereby decoding the mystery of the AI black box.

When there are tens of thousands, even millions, of data points, how do we understand that data and extract their patterns and trends? Mohsen explained that we learn by finding parameters that capture what patterns exist—the more sophisticated the data, the more parameters we need. But using more parameters necessitates a greater understanding of what each does. “There are multiple approaches and even definitions of what interpretability is,” he said. Fundamentally, though, “it is just learning how the prediction process works or which input features are corresponding to a specific prediction.”

The Earlham Institute researchers used MetaCycle—a tool for detecting circadian signals in transcriptomic data—to analyze a dataset of Arabidopsis genomic transcripts. Using this information, the researchers trained a series of ML classifiers to predict if a transcript was circadian or non-circadian. They found that the AI was not just using gene expression levels, but also timepoints for its predictions. However, these predictions were not always one-hundred percent accurate, and the researchers thus set out to ascertain the optimal sampling strategy and number of timepoints needed.

Circadian gene expression rhythms follow diverse patterns, but all share a twenty-four-hour periodicity. Having fewer timepoints is more efficient, but leads to concerns over loss of information and accuracy. The researchers aimed to find the optimal balance between a low number of transcriptomic timepoints and improved accuracy, so they started with a twelve timepoint ML model and sequentially reduced it to three timepoints. 

The explainablity aspect of their model comes with understanding how the model was making its predictions. The researchers needed to see which k-mers (short sequences of DNA) were the most influential in impacting the ML model’s predictions, and found that the most accurate predictions resulted from a k-mer length of six.

Overall, the study showed the possibility for reducing the number of transcriptomic timepoints while still maintaining accuracy in predicting circadian rhythmicity. Since creating datasets takes significant time and resources, a reduction in sampling could have important long-term impacts in increasing efficiency.

The findings of this study have major implications for the future of biomedical science and AI: recent studies have shown that disruption of clock genes is associated with sleep disorders, heightened susceptibility to infections, Alzheimer’s disease, and metabolic syndrome. “[Machine learning] has already reshaped a significant part of how we study the biology of disease,” Mohsen said. “I very much see AI playing a larger role in drug development and in terms of the way we study biology.”

More recently, Mohsen and the Earlham Institute researchers have shifted to a new focus: advancing the clarity of how and why these powerful algorithms are providing the predictions that they do. As scientists explore foundational questions of how human physiology works, understanding the powerful tools used in probing those questions is just as crucial. According to Mohsen, having unexplainable AI poses “a huge risk in medicine and elsewhere” due to its prevalence in everyday life, including face recognition, surveillance, and biohealth. 

In illuminating the “black box” for ML models that predict circadian rhythms, research merging transparent AI and genomics opens possibilities for understanding the rapidly-developing technology in our hands. Ultimately, this has implications for precision medicine, novel drug development, and decoding the genetic basis of disease in the future.