Tracking Down Lethal Mosquitoes: Machine learning to map the genetic connectivity of A. aegypti in the southern United States

Tracking Mosquitos - Noora Said

Art courtesy of Noora Said.

With global temperatures reaching record highs over the last few years, climate change has expanded the ranges of disease-carrying animals. Even tiny insects, such as Aedes aegypti mosquitoes, which are native to Africa, are not exempt from this trend. Because of their highly invasive nature, recent increases in global temperatures have allowed the range of A. aegypti to continue expanding worldwide, throughout tropical and temperate regions. In the United States, they can be found in the southern states, most notably in Texas, Florida, and California. They prefer warm, humid areas close to humans who can serve as sources of bloodmeals for female mosquitoes. Feasting on our blood provides not only a means for A. aegypti to nourish themselves, but also allows them to transmit infectious diseases like yellow fever, Zika, dengue, and chikungunya. 

Current climate trends could result in the exposure of one billion additional people to these diseases. The fact that, with the exception of yellow fever, there are no reliable and widely used vaccines for these diseases creates an urgent need for improved tracking and management of A. aegypti populations. Two scientists at the Yale School of the Environment—Evlyn Pless, a postdoctoral researcher at UC Davis who completed her graduate studies in ecology and evolutionary biology at Yale, and Giuseppe Amatulli, a research scientist in geocomputation and spatial science—collaborated on a recent study to tackle this pressing environmental issue. Their goal was to map North American landscape connectivity for A. aegyti mosquitoes.

Old Limitations in Landscape Genetics

Landscape genetics, which is the study of organisms’ population genetic data alongside landscape data from their habitats, provides scientists with helpful information that could be used to control invasive species, such as A. aegypti. Classical models of population genetics relate increased genetic differentiation to increased geographical distance, but these models do not account for environmental limitations on dispersal, such as geographic barriers, or for landscape variables that facilitate connectivity, like favorable climate.

A common approach to incorporate environmental data into a model of genetic connectivity is “resistance surface mapping.” In this technique, a map of pixels is created where each pixel represents the hypothesized resistance of an organism’s movement. These hypotheses consider environmental variables. But although the resulting map of resistance surfaces allows for extrapolation of genetic distributions across the region, this method involves substantial subjectivity, as it relies on the hypothesized effect of an environmental factor on population mobility. 

Prior studies attempted to circumvent the subjectivity of resistance surface mapping by modelling genetic connectivity directly from environmental data and iteratively refining the model using least cost path analysis, a mathematical technique that estimates the least resource-intensive route along which an organism could travel. However, according to Amatulli and Pless, this model was limited because it employed a mathematical method known as maximum likelihood estimation, which establishes a relationship between environmental variables and genetic distances before the model is built. The model produced by this methodology can therefore “overfit” the data: while it may have been optimized to match the data, it does not necessarily make accurate ecological estimations.

A New Approach to Modeling

To improve upon these previous models, Pless and Amatulli took a groundbreaking approach towards determining the relationship between landscape variables and genetic distance. First, they sampled mosquitos from thirty-eight sites across North America and calculated the genetic differences between mosquitoes in each of those sites. Open-source data for twenty-nine different environmental factors, such as daily temperature ranges and accessibility to major cities, served as variables meant to explain and predict genetic distances. For every pair of sites in the set of thirty-eight sampling sites, the researchers calculated average values describing each of the twenty-nine environmental factors that lay in between. 

Using those values on genetic distances and environmental factors, Pless and Amatulli employed a fascinating strategy to predict the effects of landscape conditions on genetic connectivity: they used a machine learning method called a random forest (RF)—a predictive model based on aggregations of possible effects of different factors—to create landscape resistance maps. In doing so, they generated a map representing landscape resistance, which measured the difficulty of mosquito migration at any particular point based on all of the twenty-nine environmental factors combined.

Using the resistance map built by RF, least cost paths were drawn between each pair of thirty-eight sampling sites. These paths describe the least energetically expensive routes between sites that could be taken by mosquitoes over a particular landscape. Then, multiple iterations of this process were conducted by generating more resistance maps via RF and subsequently re-calculating least cost paths. As a result, the model refined itself with each subsequent iteration, producing increasingly accurate predictions of mosquito genetic connectivity based on landscape data. 

Compared to previous landscape genetic models, one distinct advantage of RF is its resistance to overfitting—the model produced by RF is not overly sensitive to noise in the data that it was trained on. To test if their model overfit the data, Pless and Amatulli conducted “leave-one-out cross-validation,” meaning that the model was run thirty-eight times, each time leaving out one of the thirty-eight genetic data points. 

Results generated by this cross-validation process were comparable to the results from a model based on a full data set, demonstrating that the model was not overfitting the data. In fact, leave-two-out cross-validation was also conducted, which further demonstrated that the model did not overfit the data and could accurately predict genetic distances.

But while this novel iterative approach combined with RF did, in fact, prove to be effective, it was initially time-consuming. To address this, the researchers decided to use GRASS—a geographic information system software—in addition to R, a statistical computing environment. “We wrote everything first in R, but it was not fast enough to do all the iterations, so then we used GRASS to build cost paths analysis… and we conducted the machine learning part in R, allowing us to speed up the process from twenty-four hours to one or two,” Amatulli said.

The researchers had successfully constructed a model that predicted genetic distances based on landscape data more accurately than previous models. This meant that they could track the movements of A. aegypti mosquitoes based on the idea that landscapes facilitating the animal’s movements allow for greater genetic connectivity. Inversely, they could also consider the possibility that landscape barriers result in greater genetic distances due to lower movement.  Interestingly, of the twenty-nine environmental variables tested, the researchers discovered that maximum temperature was the most important in predicting genetic connectivity, followed by slope, barren land cover, and human density.

A Tool for Ecological Intervention and Protection

The novel model of A. aegypti landscape genetics is important because it accurately predicts the genetic connectivity—and thus the mobility—of mosquitoes in regions where samples were not taken and genotyped. 

The broader implications of tracking mosquitos with such accuracy involve recently developed methods of disease control. A prime example of this is releasing genetically modified mosquitos that will prevent wild mosquito populations from reproducing. This kind of model provides valuable knowledge about where the mosquitoes should be released and how far the intervention will spread. Another potential application is demonstrating the effects of pesticide application, which would cause natural selection of mosquitoes with pesticide-resistant genes. In theory, the model could be used to predict where those genes would prevail.

Beyond mosquitoes, this study—with its iterative RF methods—is an innovative step forward in the field of landscape genetics. The novel strategies that were employed could help protect corridors for vulnerable animal populations. “We hope this paper will be inspiring to people, more broadly, who are trying to control invasive species or who are trying to protect endangered species,” Pless said.

Acknowledgements: The author would like to thank Evlyn Pless and Giuseppe Amatulli for their time and enthusiasm about their research.

Elizabeth Wu is a freshman in Saybrook College. In addition to YSM, she is an intern for the New Haven Public School Advocates through Yale’s First Years in Support of New Haven.


Pless, E., Saarman, N. P., Powell, J. R., Caccone, A., & Amatulli, G. (2021). A machine-learning approach to map landscape connectivity in A. aegypti with genetic and environmental data. Proceedings of the National Academy of Sciences of the United States of America118(9), e2003201118.

Extra Reading: 

Bishop, A., Amatulli, G., Hyseni, C., Pless, E., Bateta, R., Okeyo, W. A., … & Saarman, N. P. A machine learning approach to integrating genetic and ecological data in tsetse flies (Glossina pallidipes) for spatially explicit vector control planning. Evolutionary Applications. 

Zhao N, Charland K, Carabali M, Nsoesie EO, Maheu-Giroux M, Rees E, et al. (2020) Machine learning and dengue forecasting: Comparing random forests and artificial neural networks for predicting dengue burden at national and sub-national scales in Colombia. PLoS Negl Trop Dis 14(9): e0008056.