Challenge

Challenge Description

The workshop hosts a challenge on scene-aware referential gesture generation, built on the MM-Conv dataset, which captures multimodal conversational interactions in 3D environments with synchronized speech, motion capture, and 3D scene representations. Its annotations for referential gestures provide a unique testbed for evaluating whether generated motion is temporally aligned with speech and spatially grounded in the environment.

Task: Given a spoken utterance, conversational context, and a virtual 3D scene, participants must generate co-speech referential gestures that are both communicatively appropriate and spatially consistent with the referenced objects. This task requires jointly reasoning about language semantics, scene geometry, and motion dynamics.

Evaluation: Submissions will be evaluated on motion quality (FGD, diversity) and spatial grounding accuracy, measuring whether generated referential gestures correctly indicate the target objects in the scene. Baselines and evaluation code will be released alongside the challenge.

For full details, see the challenge paper below and the HuggingFace page.
Access dataset here.

Submission deadline: July 26, 2026
Notification: August 5, 2026

Interactive visualizer - Inspect pointing gesture samples from dataset:

Explore pointing gesture samples from the dataset. Each clip shows motion around the gesture peak frame. Full motion sequences and annotations are available on the HuggingFace dataset page.

Read the challenge paper:

hsi_challenge_release_v1.pdf

Page updated

Google Sites

Report abuse