Image courtesy of Flickr.
Imagine you’re at First Year Formal. You’re talking in a circle, and your belly is full of the mini hot dogs. When the room gets crowded, your friends instinctively shuffle closer together to form a smaller circle, without even a hitch in the conversation.
Not that exciting, right? But upon deeper thought, that non-verbal group movement and maintenance of a shared focus of attention during a conversation are crucial aspects of human social awareness—aspects that robots have yet to master. In robotics, this field of research is called conversational group detection, and it is precisely what the Vazquez lab at Yale University and the Savarese lab at Stanford University are working on.
“There’s all these tiny social cues that we as humans are conditioned to understand, but robots don’t come with these baked in— you have to teach them,” said Sydney Thompson, a Yale Ph.D. student. “The way that I see it, group detection is one of the fundamental social skills and without it, you can’t really have a conversation; you can’t really be a social agent.”
The conversational group detection method proposed by Thompson, Marynel Vazquez—an assistant professor of computer science at Yale—and other researchers at the Vazquez and Savarese labs is a novel neural network called Deep Affinity Network for clustering conversational interactants, or DANTE.
DANTE receives visual images of social scenes as inputs: each person is labeled with a unique identifier and a feature vector, a mathematical representation of the characteristics of an individual, with spatial information such as their 2D position and orientation in the room. The scene is then represented as a graph with the people as nodes connected by edges. DANTE’s main job is to predict the affinity scores for each edge, or how likely it is for the two agents connected by that edge to be engaged in the same conversation.
While previous approaches have required brittle heuristics and subsequent ad-hoc steps to verify the detected groups and account for context, DANTE is primarily data-driven and evades these issues, making it more easily generalizable to future applications. DANTE’s groundbreaking accuracy largely stems from how it reasons around two types of information: not only the spatial information of the dyad (two individuals) of interest connected by an edge, but also global spatial information of the other people in the room. The final affinity score is computed by concatenating—combining—the dyad information with context information.
However, there is still a lot more work to be done. “When you start working on a research project, the more you know, the more you know what you don’t know,” Vazquez said. Researchers at the Vazquez lab are curious about introducing temporal dependency between the visual input into DANTE, addressing how physical elements and occlusions may disrupt DANTE’s function, and even using this approach to study how people’s spatial interactions may change post-COVID. Vazquez is hopeful that someday, granting robots this social intelligence will allow them to integrate into our world seamlessly, as effective collaborators and social agents in our increasingly complex, dynamic environment.
Citation: Swofford, Mason, et al. “Improving Social Awareness Through DANTE: Deep Affinity Network for Clustering Conversational Interactants.” Proceedings of the ACM on Human-Computer Interaction, vol. 4, no. CSCW1, May 2020, pp. 1–23., doi:10.1145/3392824.