SARA : A Socially-Aware Robot Assistant


SARA is a Socially-Aware Robot Assistant developed in Carnegie Mellon University’s ArticuLab, who interacts with people in a whole new way, personalizing the interaction and improving task performance by relying on information about the relationship between the human user and virtual assistant. Rather than taking the place of people, Sara is programmed to collaborate with her human users. Rather than ignoring the socio-emotional bonds that form the fabric of society, SARA depends on those bonds to improve her collaboration skills.

My Contribution

  • Multi-modal interaction research, design, and visualization;

  • Interaction data collection, analysis, and evaluation (quantitative)

  • Work Duration: 6 months

  • Research paper writing


Technology Overview

In terms of detection, Sara can recognize visual (body language, using algorithms we developed, as well as the capabilities of OpenFace), vocal (acoustic features of speech, such as intonation or loudness, also using algorithms we developed, as well as the capabilities of OpenSmile) and verbal (linguistic features of the interaction such as conversational strategies, using models and binary classifiers we developed) aspects of a human user’s speech.

We leverage the power of recurrent neural networks (deep learning techniques) and L2 regularized logistic regression (a discriminative model in machine learning) with multimodal information from both the user and SARA (speech, acoustic voice quality, and the conversational strategies described above) to learn the fine-grained temporal relationships among these modalities, and their contextual information. Sara uses those sources of input to estimate the rapport between user and agent in real-time. We call this social intention recognition, based on the classic natural language processing and AI process of “(task) intention recognition”


"This relationship between Sara and her human user is the social infrastructure for improved performance."

SARA at World Economic Forum


SARA at the World Economic Forum

SARA is designed to build interpersonal closeness or rapport over the course of a conversation by managing rapport through the understanding and generation of visual, vocal, and verbal behaviors.

  1.  The computational model of rapport:  The computational model is the first to explain how humans in dyadic interactions build, maintain, and destroy rapport through the use of specific conversational strategies that function to fulfill specific social goals, and that is instantiated in particular verbal and nonverbal behaviors.

  2. Conversational strategy classification: The conversational strategy classifier can recognize high-level language strategies closely associated with social goals through training on linguistic features associated with those conversational strategies in a test set.

  3. Rapport level estimation: The rapport estimator estimates the current rapport level between the user and the agent using temporal association rules.

  4. Social and task reasoning: The social reasoner outputs a conversational strategy that the system must adapt in the current turn. The reasoner is modeled as the spreading activation network.

  5.  Natural language and nonverbal behavior generation:  The natural language generation module expresses conversational strategies in specific language and associated nonverbal behaviors, and they are performed by a virtual human.

Computational Model of SARA

Interaction Flow of SARA


SARA’s Understanding Process

user detecting (1).gif

Conversational Strategy Classifier


We have implemented a conversational strategy classifier to automatically recognize the user’s conversational strategies – particular ways of talking, that contribute to building, maintaining, or sometimes destroying a budding relationship. By including rich contextual features drawn from verbal, visual and vocal modalities of the speaker and interlocutor in the current and previous turns, we can successfully recognize these dialogue phenomena with an accuracy of over 80% and with a kappa of over 60%

attendee's social strategy-min.gif
relationship strength (2).gif
relationship strength.gif

Rapport Estimator


We use the framework of temporal association rule learning to perform a fine-grained investigation into how sequences of interlocutor behaviors signal high and low interpersonal rapport. The behaviors analyzed include visual behaviors such as eye gaze and smiles, and verbal conversational strategies, such as self-disclosure, shared experience, social norm violation, praise, and back-channels.


SARA’s Reasoning Process


Social Reasoner


The social reasoner is designed as a spreading activation model – a behavior network consisting of activation rules that govern which conversational strategy the system should adopt next. Taking as inputs the system’s phase (e.g. “recommendation”) system’s intentions (e.g. “elicit_goals”, “recommend_session”), the history of the user’s conversational strategies, select non-verbal behaviors (e.g. head nod and smile) and the current rapport level, the activation energies will be updated.

Task Reasoner

Based on the WoZ personal assistant corpus we collected, the task reasoner was designed as a finite state machine whose transitions are governed by an expressive set of rules. The module uses the user’s intention (identified by the NLU), the current state of the dialog (which it maintains) and other contextual information (e.g., how many sessions it has recommended) to transition to a new state, and generate the system intent associated with that state.

SARA'S social strategy-min.gif
SARA generation process (1).gif

SARA’s Generating Process

Verbal and Nonverbal Language Generation


Given the system’s intention (which includes the current conversational phase, the system intent, and the conversational strategy) these modules generate sentence and behavior plans. The Natural Language Generator (NLG)selects a certain syntactic template associated with the system’s intention from the sentence database. A generated sentence plan is sent to the BEAT (Behavior Expression Animation Toolkit), and BEAT generates a behavior plan in the BML (Behavior Markup Language) form.


Field Study Analysis

Field Study

SARA was presented at the World Economic Forum (WEF) Annual Meeting in Davos (January 17-20, 2017). The SARA booth was located right in the middle of the main corridor of the Davos Congress Center and was, in fact, the only demo in the Congress Center.


SARA had access to the WEF database of sessions being presented, participants attending, demos being shown in the Loft across the street, and places to get food in the Congress Center (she also knew about some private parties – information she was willing to share if asked nicely!). SARA was programmed to use this information to act as a virtual personal assistant. She assisted the global leaders attending Davos by finding out about their interests and goals in attending the WEF and then recommending sessions and people who were relevant to their interests and goals.


Research Question

Research Question: “How does the task performance of a personal assistant affect the dynamics of rapport over the course of an interaction?“
  • Hypothesis 1: Attendees with a high score of rapport are more prone to accept recommendations from SARA.

  • Hypothesis 2: Good recommendations enhance the score of rapport.

Data Reconstruction


During each interaction, attendees’ video and audio were recorded using a camera and a microphone. SARA’s animations, for their part, were recorded separately in a log file. Audio records were used to get text transcriptions of both attendee’s and SARA’s utterances using a third-party transcription service. These transcriptions contained turn-taking information such as speaker ID and starting and ending timestamps for each turn. With rapport being a dyadic phenomenon, we eventually reconstructed the interactions to have both attendee and SARA present in the same video before annotating them.

Annotation Methods


We choose to use the 30s-slice method to let AMT online helpers annotate all the interaction pieces. To keep the privacy of the attendees, we blurred their faces and Infos SARA offered but tracking the facial expressions with makers on the face to help evaluate. In order to make sure the blurring is not affecting the annotation scores. We did an internal annotation experiment to calculate the IRR.

Analysis Results


After the attendees entered the booth, SARA first introduced herself and asked several questions about the attendees’ current feelings and mood. Then, the attendees were asked about their occupation as well as their interests and goals for attending the conference. SARA would then cycle through several rounds of people and/or session recommendations, showing information about the recommendation on the virtual board behind her. The attendees were able to request as many recommendations as desired and were able to leave the booth anytime they wanted. Finally, SARA proposed to take a “selfie” with the attendees before saying farewell. Our corpus contains data from 69 of these completed interactions, including both attendee’s and SARA’s video, audio and textual speech transcription, which combined accounted for more than 5 hours of interaction (total time = 21055 seconds, mean session duration = 305.15 seconds, SD = 65.00 seconds). Out of these 69 attendees, 29 were women and 40 were men. We did not gather any information about the attendees’ age or nationality.

And we collected all the data from the interactions in terms of different features and metrics, using different data mining methods to extract useful relationships and trends.


Conclusion & Discussion


1.Modeling entrainment to increase coordination

Indeed, we found that word count balance was negatively correlated with both rapport and task-performance, meaning that an un-matching linguistic style might lead to lower rapport and task-performance. One solution to overcome this potential issue is to design an ECA able to adapt its linguistic style to the user’s one.

2.Deeper preferences elicitation to increase task efficiency

One potential solution would be to explicitly confirm the attendee’s interests before going to the recommendations phases. The 9% of relevant recommendations when attendees asked for a more precise or specific recommendation in one relevant domain.

3.Incremental architecture to increase mutual attentiveness

Through this field study, we also emphasized the importance of reducing ECA’s response time, as we noticed that long response time was possibly the cause of low rapport scores.

4.Explanations to increase trust

One simple solution to answer to this question is for SARA to ask for explanations whenever the attendee refuses a recommendation without giving any information.

5.Combining task and social reasoning to increase naturalness of the interaction

Modification of the rapport computational model