CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

March 20, 2023

We provide a large-scale dataset of 18M video-based dialogues, and introduce a generative model of conversations that learns from video-based dialogues so that it account for visual contexts.

Paper Link

Visual information is central to conversation: body gestures and facial expressions, for example, contribute to meaning that transcends words alone. To date, however, most neural conversational models are limited to just text. We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from web videos: crucial to our data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning.

Human evaluation reveals that YTD-18M is more sensible and specific than prior resources (MMDialog, 1M dialogues), while maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it achieves state-of-the-art results on four vision-language tasks focused on real-world conversations.

Will be updated! Stay tuned!






If the paper inspires you, please cite us:

  title={CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos},
  author={Seungju Han and Jack Hessel and Nouha Dziri and Yejin Choi and Youngjae Yu},