Publications

*indicates equal contribution

2025

Abstract

In my work, I study how robots can infer and adapt to human intentions in dynamic environments. My prior work focuses on three key methods for adapting robot actions based on inferences of human intent: passive observation, active influence, and high-level verbal commands. Moving forward, I aim to improve how robots anticipate and respond to evolving human behavior by addressing the challenges of predicting changes in human intentions and quantifying human adaptability.

Adapting Robot Actions to Human Intentions in Dynamic Shared Environments

Debasmita Ghose

ACM International Conference on Human- Robot Interaction - Pioneers Workshop (HRI-Pioneers, 2025), Melbourne, Australia

(Poster)

Paper

Video

Poster

Abstract

Visual contrastive learning aims to learn representations by contrasting similar (positive) and dissimilar (negative) pairs of data samples. The design of these pairs significantly impacts representation quality, training efficiency, and computational cost. A well-curated set of pairs leads to stronger representations and faster convergence. As contrastive pre-training sees wider adoption for solving downstream tasks, data curation becomes essential for optimizing its effectiveness. In this survey, we attempt to create a taxonomy of existing techniques

for positive and negative pair curation in contrastive learning, and describe them in detail. We also examine the trade-offs and open research questions in data curation for contrastive learning.

A Survey on Data Curation for Visual Contrastive Learning: Why Crafting Effective Positive and Negative Pairs Matters

Shasvat Desai, Debasmita Ghose, Deep Chakraborty

arXiv Preprint

Paper

2024

Abstract

Stress detection in real-world settings presents significant challenges due to the complexity of human emotional expression influenced by biological, psychological, and social factors. While traditional methods like EEG, ECG, and EDA sensors provide direct measures of physiological responses, they are unsuitable for everyday environments due to their intrusive nature. Therefore, using non-contact, commonly available sensors like cameras and microphones to detect stress would be helpful. In this work, we use stress indicators from four key affective modalities extracted from audio-visual data: facial expressions, vocal prosody, textual sentiment, and physical fidgeting.

To achieve this, we first labeled 353 video clips featuring individuals in monologue scenarios discussing personal experiences, indicating whether or not the individual is stressed based on our four modalities.

Then, to effectively integrate signals from the four modalities, we extract stress signals from our audio-visual data using unimodal classifiers. Finally, to explore how the different modalities would interact to predict if a person is stressed, we compare the performance of three multimodal fusion methods: intermediate fusion, voting-based late fusion, and learning-based late fusion. Results indicate that combining multiple modes of information can effectively leverage the strengths of different modalities and achieve an F1 score of 0.85 for binary stress detection. Moreover, an ablation study shows that the more modalities are integrated, the higher the F1 score for detecting stress across all fusion techniques, demonstrating that our selected modalities possess complementary stress indicators.

Integrating Multimodal Affective Signals for Stress Detection from Audio-Visual Data

Debasmita Ghose*, Oz Gitelson*, Brian Scassellati

ACM International Conference on Multimodal Interaction, 2024 (ICMI 2024), San Jose, Costa Rica

(Poster)

Paper

Website

Poster

Abstract

To enable sophisticated interactions between humans and robots in a shared environment, robots must infer the intentions and strategies of their human counterparts. This inference can provide a competitive edge to the robot or enhance human-robot collaboration by reducing the necessity for explicit communication about task decisions. In this work, we identify specific states within the shared environment, which we refer to as Critical Decision Points, where the actions of a human would be especially indicative of their high-level strategy. A robot can significantly reduce uncertainty regarding the human's strategy by observing actions at these points. To demonstrate the practical value of Critical Decision Points, we propose a Receding Horizon Planning (RHP) approach for the robot to influence the movement of a human opponent in a competitive game of hide-and-seek in a partially observable setting. The human plays as the hider and the robot plays as the seeker. We show that the seeker can influence the hider to move towards Critical Decision Points, and this can facilitate a more accurate estimation of the hider's strategy. In turn, this helps the seeker catch the hider faster than estimating the hider's strategy whenever the hider is visible or when the seeker only optimizes for minimizing its distance to the hider.

Planning with Critical Decision Points: Robots that Influence Humans to Infer Their Strategy

Debasmita Ghose*, Michal Lewkowicz*, David Dong, Andy Cheng, Tran Doan, Emma Adams, Marynel V ́azquez and Brian Scassellati

IEEE International Conference on Robot & Human Interactive Communication, 2024 (RO-MAN 2024), Pasadena, California

(Oral)

Paper

Slides

2023

Abstract

One important aspect of effective human--robot collaborations is the ability for robots to adapt quickly to the needs of humans. While techniques like deep reinforcement learning have demonstrated success as sophisticated tools for learning robot policies, the fluency of human-robot collaborations is often limited by these policies' inability to integrate changes to a user's preferences for the task. To address these shortcomings, we propose a novel approach that can modify learned policies at execution time via symbolic if-this-then-that rules corresponding to a modular and superimposable set of low-level constraints on the robot's policy. These rules, which we call Transparent Matrix Overlays, function not only as succinct and explainable descriptions of the robot’s current strategy but also as an interface by which a human collaborator can easily alter a robot's policy via verbal commands. We demonstrate the efficacy of this approach on a series of proof-of-concept cooking tasks performed in simulation and on a physical robot.

Interactive Policy Shaping for Human-Robot Collaboration with Transparent Matrix Overlays

Jake Brawer, Debasmita Ghose, Kate Candon, Meiying Qin, Alessandro Roncone, Marynel Vazquez, Brian Scassellati

ACM/IEEE Internation Conference on Human-Robot Interaction, 2023 (HRI 2023),

Stockholm, Sweden

(Oral)

Best Paper Award Winner (Technical)

Paper

Code

2022

Abstract

Robots are well-suited to alleviate the burden of repetitive and tedious manipulation tasks. In many applications, though, a robot may be asked to interact with a wide variety of objects, making it hard or even impossible to pre-program visual object classifiers suitable for the task of interest. In this work, we study the problem of learning a classifier for visual objects based on a few examples provided by humans. We frame this problem from the perspective of learning a suitable visual object representation that allows us to distinguish the desired object category from others. Our proposed approach integrates human supervision into the representation learning process by combining contrastive learning with an additional loss function that brings the representations of human examples close to each other in the latent space. Our experiments show that our proposed method performs better than self-supervised and fully supervised learning methods in offline evaluations and can also be used in real-time by a robot in a simplified recycling domain, where recycling streams contain a variety of objects.

Tailoring Visual Object Representations to Human Requirements: A Case Study with a Recycling Robot

Debasmita Ghose, Michal Lewkowicz, Kaleb Gezahegn, Juilan Lee*, Timothy Adamson*, Marynel Vazquez, Brian Scassellati

Conference on Robot Learning, 2022 (CoRL 2022), Auckland, New Zealand

(Poster)

Paper

Code

Project Page

Video

Poster

Abstract

Regular exercise provides many mental and physical health benefits. However, when exercises are done incorrectly, it can lead to injuries. Because the COVID-19 pandemic made it challenging to exercise in communal spaces, the growth of virtual fitness programs was accelerated, putting people at risk of sustaining exercise-related injuries as they received little to no feedback on their exercising techniques. Co-located robots could be one potential enhancement to virtual training programs as they can cause higher learning gains, more compliance, and more enjoyment than non-co-located robots. In this study, we compare the effects of a physically present robot by having a person exercise either with a robot (robot condition) or a video of a robot displayed on a tablet (tablet condition). Participants (N=25) had an exercise system in their homes for two weeks. Participants who exercised with the co-located robot made fewer mistakes than those who exercised with the video-displayed robot. Furthermore, participants in the robot condition reported a higher fitness increase and more motivation to exercise than participants in the tablet condition.

The Impact of an In-Home Co-Located Robotic Coach in Helping People Make Fewer Exercise Mistakes

Nicole Salomons*, Tom Wallenstein*, Debasmita Ghose*, Brian Scassellati

IEEE International Conference on Robot & Human Interactive Communication, 2022 (RO-MAN 2022), Naples, Italy

(Oral)

Paper

Abstract

Remote sensing data is crucial for applications ranging from monitoring forest fires and deforestation to tracking urbanization. Most of these tasks require dense pixel-level annotations for the model to parse visual information from limited labeled data available for these satellite images. Due to the dearth of high-quality labeled training data in this domain, there is a need to focus on semi-supervised techniques. These techniques generate pseudo-labels from a small set of labeled examples which are used to augment the labeled training set. This makes it necessary to have a highly representative and diverse labeled training set. Therefore, we propose to use an active learning-based sampling strategy to select a highly representative set of labeled training data. We demonstrate our proposed method's effectiveness on two existing semantic segmentation datasets containing satellite images: UC Merced Land Use Classification Dataset and DeepGlobe Land Cover Classification Dataset. We report a 27% improvement in mIoU with as little as 2% labeled data using active learning sampling strategies over randomly sampling the small set of labeled training data.

Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images

Shasvat Desai*, Debasmita Ghose*

IEEE/CVF Winter Conference on Applications of Computer Vision, 2022 (WACV 2022), Waikoloa, HI

(Oral)

Paper

Project Page

Code

2021

Abstract

In this paper, we argue in favor of creating robots that both teach and learn. We propose a methodology for building robots that can learn a skill from an expert, perform the skill independently or collaboratively with the expert, and then teach the same skill to a novice. This requires combining insights from learning from demonstration, human-robot collaboration, and intelligent tutoring systems to develop knowledge representations that can be shared across all three components. As a case study for our methodology, we developed a glockenspiel-playing robot. The robot begins as a novice, learns how to play musical harmonies from an expert, collaborates with the expert to complete harmonies, and then teaches the harmonies to novice users. This methodology allows for new evaluation metrics that provide a thorough understanding of how well the robot has learned and enables a robot to act as an efficient facilitator for teaching across temporal and geographic separation.

Why We Should Build Robots that both Teach and Learn

Timothy Adamson*, Debasmita Ghose*, Shannon Yasuda, Lucas Shepard, Michal Lewkowicz, Joyce Duan, Brian Scassellati

ACM/IEEE Internation Conference on Human-Robot Interaction, 2021 (HRI 2021), Boulder, CO (virtual due to COVID)

(Oral)

Paper

Video

2019

Abstract

Thermal images are mainly used to detect the presence of people at night or in bad lighting conditions, but perform poorly at daytime. To solve this problem, most state-of-the-art techniques employ a fusion network that uses features from paired thermal and color images. Instead, we propose to augment thermal images with their saliency maps, to serve as an attention mechanism for the pedestrian detector especially during daytime. We investigate how such an approach results in improved performance for pedestrian detection using only thermal images, eliminating the need for paired color images. For our experiments, we train the Faster R-CNN for pedestrian detection and report the added effect of saliency maps generated using static and deep methods (PiCA-Net and R3-Net). Our best performing model results in an absolute reduction of miss rate by 13.4% and 19.4% over the baseline in day and night images respectively. We also annotate and release pixel level masks of pedestrians on a subset of the KAIST Multispectral Pedestrian Detection dataset, which is a first publicly available dataset for salient pedestrian detection.

Pedestrian Detection In Thermal Images Using Saliency Maps

Debasmita Ghose*, Shasvat Desai*, Sneha Bhattacharya*, Deep Chakraborty*, Madalina Fiterau, Tauhidur Rahman

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019 (CVPR 2019), Long Beach, CA

(Oral - Spotlight Talk)

Paper

Project Page

Code

Video