Predicting Inaccessible Information

By Sydney Hirsch

November 18, 2021

Image courtesy of Wikimedia Commons.

In scientific experimentation, some information is more attainable than others, by nature of the method of retrieval. For instance, clinicians can easily gather blood pressure and other laboratory values; aptly, this type of data is called easy-to-obtain information (EI). However, other data may be too expensive, time-consuming, or both to collect on a larger scale. Flow cytometry, a laser-based technique to measure the chemical and physical properties of cells, is an example of this type of hard-to-obtain information (HI). To circumvent these limitations, a team of Yale researchers, including graduate student Matthew Amodio and associate professor Smita Krishnaswamy, developed a model called the Feature Mapping Generative Adversarial Network (FMGAN) that allows for the accurate prediction of HI given EI. Their methodology is novel—in fact, Krishnaswamy’s lab pioneered all of the frameworks used throughout the study, even those used as comparators to the FMGAN.

The most recent study applied the neural network model in two contexts: one generated RNA sequences of cells perturbed with drugs, a form of HI, via the chemical structure of the compound, a form of EI. The second predicted the flow cytometry data (HI) of COVID-19 patients using clinical measurements (EI).

The FMGAN’s predictive capabilities come from the addition of a condition-embedding network. This network transforms the EI into representations called manifolds that are easier to visualize, reduce redundancy, and thus simplify data extrapolation. “The condition-embedding network translates the data from how it exists naturally to a form more easily used by our model, which it gradually learns how to do,” Amodio said. The manifold structure is preferable to the alternative form of data representation in ambient space, as its smooth structure produces outputs that move uniformly with changes in input. This point is especially relevant in the context of Krishnaswamy’s work with chemical structure and RNA sequencing—small modifications to certain portions of the input can determine molecular function, so it is important to maintain such consistency in the magnitudes of movements.

Further, the scientists introduced stochastic mapping, a measure of randomness, into the model. “The drugs do not produce a single result every time,” Amodio said. “The cell measurements change even in applying the same drug to the same system. There are lots of sources of randomness with respect to the data we looked at. Thus, it makes sense to use models that include randomness to accurately represent that.” In other words, stochastic mapping was another deliberate addition to their neural network that further increased prediction reliability.

In applying the FMGAN to predict the RNA sequencing data of cells treated with drugs, the team performed four experiments. In the first two, they provided the model with preprocessed data and good manifold coordinates; the purpose was simply to show that information could be generated from the data. After demonstrating the FMGAN’s success under these conditions, the researchers executed two more challenging experiments that required the full capabilities of the network in creating its own manifold coordinates. One tested the condition of drug chemical structure in the form of simplified molecular-input line-entry system (SMILES) strings, a specialized notation system. The other instead looked at image representations of said chemical structure. The latter performed better than the former, likely due to the more advanced architecture of the images compared to the strings. Both, however, demonstrated the efficacy of the FMGAN and its condition-embedding network.

To demonstrate the breadth of the FMGAN’s applications, the researchers also tested its ability to predict future flow cytometry information from COVID-19 patients’ clinical measurements upon entering the ICU. During experimentation, researchers took both clinical measurements and flow cytometry measurements from all study participants. They omitted the data of fourteen patients, training the neural network model on the remaining 115. Ultimately, the FMGAN was able to use clinical measurements to generate flow cytometry predictions for the never-before-seen patients. In practice, this data gives clinicians insight into a patient’s immune function and is a predictor of mortality. Instantaneous and accurate determination of this HI would allow physicians to craft optimal courses of treatment.

Through this set of experiments, Krishnaswamy and her team demonstrated the efficacy of their novel FMGAN neural network model through its applications in drug discovery and clinical inference. However, the FMGAN program is not limited to these spaces—its architecture is not hardwired to address these structures specifically and can be generalized to other data. This area of quantitative computational biology is underexplored, but breakthroughs have the potential to transform how scientists leverage the information they have readily available.

Predicting Inaccessible Information

THE NATION'S OLDEST COLLEGE SCIENCE PUBLICATION