This is a linkpost for https://arxiv.org/abs/1904.09728

Background Note: Benchmark Study is a blog post series to record and study benchmark papers. I am in the process of developing a new LLM evaluation framework for more flexibility over EleutherAI LM Harness. For the initial release, I'm only adding benchmarks that I've studied. All study notes are meant to be read within 10 minutes. I will receive GPT assistance here and there while writing these blog posts. I'm publicly sharing study notes partly to keep myself going and help whoever hasn't read the paper yet. 

@inproceedings{sap2019social,
  title={Social IQa: Commonsense Reasoning about Social Interactions},
  author={Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={4463--4473},
  year={2019}
}

TL;DR

  • Dataset Overview: 38,000 multiple-choice questions to assess emotional and social intelligence in everyday contexts
  • Trainable Task: The learning curve for BERT-large indicates a substantial improvement (in Social Intelligence Reasoning) with more training data, suggesting the necessity of large-scale benchmarks for advanced reasoning tasks.
  • Transferrable Property: Sequential finetuning yielded substantial performance increases on downstream tasks (Winograd, COPA), suggesting that SOCIAL IQA provides beneficial commonsense knowledge for other challenges.

Timeline Note: Everything below is written from the perspectives of 2019 when the latest version (at the time of writing) of "Social IQa: Commonsense Reasoning about Social Interactions" was published


Section: Abstract

  • A large-scale benchmark for commonsense reasoning about social situations, featuring 38,000 multiple-choice questions to assess emotional and social intelligence in everyday contexts.
  • Crowdsourced: Commonsense questions are collected alongside correct and incorrect answers about social interactions, utilizing a novel approach to reduce stylistic biases in incorrect answers by having workers provide correct answers to related, but different, questions.
  • As a challenge for LLMs: Demonstrates that the benchmark poses a significant challenge to current question-answering models based on pre-trained language models, with a notable performance gap exceeding 20% when compared to human performance.
  • As a source for transfer learning capability: Establishes SOCIAL IQA not only as a testing ground but also as a valuable resource for enhancing commonsense knowledge in AI systems, evidenced by state-of-the-art performance improvements on multiple commonsense reasoning tasks such as Winograd Schemas and COPA.

Section: Introduction

SOCIAL IQA

  • Purpose: To introduce a large-scale benchmark for assessing social and emotional intelligence in AI models through commonsense reasoning about social situations.
  • The challenge for AI: Highlights the difficulty AI models face in understanding social nuances, partly due to limitations in training data and inherent biases in language models.
  • Contributions: Presents SOCIAL IQA as a novel dataset containing 38,000 multiple-choice questions focused on everyday social events, designed to minimize annotation artifacts and improve AI's social reasoning capabilities.

Significance of Social and Emotional Intelligence

  • Human Ability: Describes humans' innate ability to understand others' mental states and predict their actions, a fundamental aspect of navigating social interactions.
  • AI Challenge: Points out the gap in AI's ability to replicate human-level social and emotional intelligence, despite advances in language model pretraining.

SOCIAL IQA as a Resource

  • Dataset Overview: Provides a comprehensive dataset with multiple-choice questions aimed at probing AI systems on social scenarios, requiring an understanding of motivations, emotional reactions, and likely actions.
  • Methodology: Outlines a crowdsourcing framework for generating questions and answers that address social commonsense reasoning, incorporating strategies to reduce bias in incorrect answer choices.
  • AI Performance Gap: Notes that current AI systems, including those based on BERT-large, significantly underperform compared to human benchmarks on this dataset.

Transfer Learning and Performance Improvement

  • Transfer Learning Capability: Demonstrates SOCIAL IQA's effectiveness as a tool for enhancing AI models' performance on other commonsense reasoning tasks through sequential finetuning.
  • Achievements: Achieves state-of-the-art results on challenging commonsense reasoning benchmarks, including COPA and Winograd Schemas, by leveraging knowledge gained from SOCIAL IQA.

Contributions and Innovations

  • Novel Dataset: Establishes the first extensive QA dataset specifically designed to test and improve social and emotional intelligence in AI, filling a significant gap in available resources.
  • Question-Switching Technique: Introduces an innovative method for collecting diverse and unbiased incorrect answers, aiming to reduce cognitive biases from annotators.
  • Benchmark Setting: Sets new performance benchmarks for AI systems on commonsense reasoning tasks, illustrating the potential of SOCIAL IQA for advancing AI research in understanding social contexts.

Section: Task Description

Overview

  • Purpose: To assess the social and emotional intelligence of AI models using a multiple-choice question-answering format.
  • Design: Questions are designed to require inferential reasoning about social situations, mirroring the intelligence needed for AI-human interaction, such as assisting users in real-life scenarios.

Importance of Theory of Mind

  • Definition: Refers to the human ability to understand others' mental states, motivations, and needs, crucial for navigating social interactions.
  • Goal for AI: Emphasizes the aim to endow AI systems with a form of Theory of Mind to improve their understanding of social contexts and human behavior.

Utilization of ATOMIC Knowledge Graph

  • Foundation: SOCIAL IQA leverages ATOMIC, a knowledge graph with inferential knowledge about social interactions, as a basis for task creation.
  • Content: ATOMIC includes 24k event phrases categorized into nine inference dimensions, covering causes and effects on agents and others involved in events.

Creation of Contexts and Questions

  • Methodology: Generates natural language contexts from ATOMIC events and formulates questions that require commonsense reasoning to answer.
  • Diversity: Ensures a wide range of motivations, reactions, and actions are covered, mirroring the complexity of real-world social interactions.

Section: Dataset Creation

Event Rewriting for Context Creation

  • Purpose: To encompass a broad range of social situations using ATOMIC events as prompts.
  • Process: Workers on MTurk convert ATOMIC events into detailed sentences by adding names, correcting grammar, and filling placeholders.

Context, Question, and Answer Generation

  • Method: Crowdsourcing tasks generate context-question-answer triples based on ATOMIC's nine inference dimensions.
  • Detailing: Workers expand event sentences into detailed contexts and propose two potential correct answers.

Collection of Negative Answers

  • Dual Approach: Incorporates Handwritten Incorrect Answers (HIA) and Question-Switching Answers (QSA) to create adversarial incorrect options.
  • Objective: To minimize annotation biases and make it challenging for models by ensuring incorrect answers are stylistically similar to correct ones.

QA Tuple Creation and Validation

  • Aggregation: Combines data into three-way multiple-choice questions, selecting one correct answer and the least entailed incorrect answers.
  • Quality Assurance: Validates QA tuples via crowdsourcing, applying adversarial filtering to enhance challenge level.

Data Statistics and Partitioning

  • Distribution: Separates contexts into training, development, and test sets based on their originating ATOMIC event.
  • Content Analysis: Provides statistics on word count, vocabulary, and frequency of answers, noting that incorrect answers outnumber correct ones due to format.

Inference Dimensions and Question Types

  • Variety: SOCIAL IQA questions cover different types of inferential reasoning derived from ATOMIC dimensions.
  • Common Themes: Questions frequently address reactions to events and motivations, with less emphasis on involuntary effects or descriptive inquiries.

Section: Experiments

Experimental Setup for Evaluating Models on SOCIAL IQA

  • Training Details: Models were trained on 33k instances from the SOCIAL IQA training set, with hyperparameters selected based on performance on the development set.
  • Models and Implementation: Utilized OpenAI-GPT and BERT models of varying sizes, employing the HuggingFace PyTorch implementation for training.

Results and Model Performance

  • Performance Gap: BERT-large, the best-performing model, achieved significantly lower accuracy than human performance, highlighting the challenge SOCIAL IQA presents to current AI systems.
  • Importance of Context and Question: Ablation studies confirm that both context and question are crucial for the model's reasoning process, as removing them drastically reduces performance.

Learning Curve Analysis:

  • Dataset Scale Impact: The learning curve for BERT-large indicates substantial improvement with more training data, suggesting the necessity of large-scale benchmarks for advanced reasoning tasks.

Error Analysis and Observations:

  • Question Type Difficulty: Models found questions about pre-conditions less challenging compared to those on involuntary effects, motivations, and future actions.
  • Lexical Association Limitations: Errors suggest models might be relying on simple lexical associations rather than performing complex reasoning, leading to incorrect timing and participant-related mistakes.

Implications for AI Reasoning:

  • Challenges in Social Reasoning: Current models struggle with reasoning about social situations, a gap that could potentially be addressed with models capable of more complex reasoning or explicitly equipped with commonsense knowledge.
     

Section: SOCIAL IQA for Transfer Learning

Utilization of SOCIAL IQA for Transfer Learning:

  • Purpose: Demonstrates how sequential finetuning on SOCIAL IQA improves performance on commonsense reasoning tasks, specifically the Winograd Schema Challenge (WSC) and the Choice of Plausible Alternatives (COPA).
  • Performance Gains: Achieved state-of-the-art results on both WSC and COPA by finetuning models first on SOCIAL IQA, showing significant improvements over models not pre-finetuned on SOCIAL IQA.

Sequential Finetuning Process:

  • Methodology: BERT-large was finetuned on SOCIAL IQA before being further finetuned on task-specific datasets for COPA and WSC.
  • Results: Sequential finetuning yielded substantial performance increases on downstream tasks, suggesting that SOCIAL IQA provides beneficial commonsense knowledge for other challenges.

Impact of Dataset Scale and Knowledge Type:

  • Scale Impact: Improved performance on COPA with increased training data from SOCIAL IQA, indicating the beneficial effect of large-scale finetuning.
  • Knowledge Type Impact: Using SOCIAL IQA for finetuning resulted in better performance on COPA compared to using SWAG, highlighting the importance of social and emotional knowledge contained in SOCIAL IQA.
New Comment