NormLens: Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms (EMNLP 23)
October 08, 2023
Commonsense norms are dependent on their context. What if the context is given by image?
Our NormLens dataset is a multimodal benchmark to evaluate how well models align with human reasoning about defeasible commonsense norms, incorporating visual grounding.
To read more, check out the paper.
Commonsense norms are defeasible by context.
Reading books is usually great, but not when driving a car. While contexts can be explicitly described in language (, ), in embodied scenarios, contexts are often provided visually. This type of visually grounded reasoning about defeasible commonsense norms is generally easy for humans, but (as we show) poses a challenge for machines, as it necessitates both visual understanding and reasoning about commonsense norms.
Challenges in collecting a true multimodal dataset.
Collecting a multimodal dataset is challenging, as it is difficult to ensure that both visual and textual contexts are relevant to the commonsense norm. For example, a textual context “reading a book” is considered to be okay in most of the cases, so it is hard to find a relevant image that can be used as a visual context and affects the commonsense norm. In early testing, we found that even humans had trouble concocting diverse and interesting multimodal situations. Therefore, we developed a human-AI collaboration pipeline to collect our dataset.
We utilize a LM to help “brainstorm” input situations. More specifically, we (1) generate multimodal situations that follow the requirement using AI models, especially considering the defeasibility of commonsense norms, and (2) employ human annotators to collect actual human judgments and explanations about the generated multimodal situations. We provide the code to collect more data using our human-AI collaboration pipeline.
Dataset examples
Hover on the image to see human annotations (judgment and explanation). Note that we only put one representative annotation for each image.
Action: Take a nap.
Action: Hold hands and walk together.
Action: Provide food and water for the sheep
Action: Speak into the microphone and give a speech.
You can explore our dataset more using the Jupyter notebook.
Vision-language models' predictions on NormLens.
Here are findings from our results on the judgment task (Table 2):
(1) In general, pretrained models partially align their predictions with averaged human judgments, but a gap remains between model predictions and human agreement. In particular, models except for SMs with powerful LMs (ChatGPT/GPT-4) perform almost on par with Majority Vote.
(2) Visual inputs are important. All the SMs clearly outperform their text-only counterparts (LM) except for GPT-3 Davinci.
(3) Reasoning capability is also crucial. All VLMs show a low level of alignment, particularly in NormLens-HA where they score between 34.0% to 41.9% and are out-competed by Majority Vote.
Then, to investigate the difficulties encountered by models when making judgments, in Table 3, we provide classwise precision scores on NormLens-HA. Overall, except for SM with stronger LMs (ChatGPT/GPT-4), models show low judgment scores on Wrong. and Impossible. classes.
How can I use the dataset?
Check our README to get the dataset.
Please refer to our Evaluation README to evaluate your models on our NormLens-HA and MA.
Citation
If the paper inspires you, please cite us:
@misc{han2023reading,
title={Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms},
author={Seungju Han and Junhyeok Kim and Jack Hessel and Liwei Jiang and Jiwan Chung and Yejin Son and Yejin Choi and Youngjae Yu},
year={2023},
eprint={2310.10418},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Authors
Seungju Han1, 2, Junhyeok Kim4, Jack Hessel2, Liwei Jiang2, 3, Jiwan Chung4, Yejin Son4, Yejin Choi2, 3, Youngjae Yu2, 4
Seoul National University1, Allen Institute for AI2, University of Washington3, Yonsei University4