Data

Test scores of AI systems on various capabilities relative to human performance

See all data and research on:

About this data

Test scores of AI systems on various capabilities relative to human performance
Human performance, as the benchmark, is set to zero. The capability of each AI system is normalized to an initial performance of -1.
Source
Kiela et al. (2023) – with minor processing by Our World in Data
Last updated
April 2, 2024
Next expected update
April 2025
Date range
1998–2023

Sources and processing

This data is based on the following sources

This dataset captures the progression of AI evaluation benchmarks, reflecting their adaptation to the rapid advancements in AI technology. The benchmarks cover a wide range of tasks, from language understanding to image processing, and are designed to test AI models' capabilities in various domains. The dataset includes performance metrics for each benchmark, providing insights into AI models' proficiency in different areas of machine learning research.

  • BBH (BIG-Bench Hard): This benchmark serves as a rigorous evaluation framework for advanced language models, targeting their capacity for complex reasoning and problem-solving. It identifies tasks where AI models traditionally underperform compared to human benchmarks, emphasizing the enhancement of AI reasoning through innovative prompting methods like Chain-of-Thought.
  • GLUE (General Language Understanding Evaluation): GLUE is a comprehensive benchmark suite designed to assess the breadth of an AI model's language understanding capabilities across a variety of tasks, including sentiment analysis, textual entailment, and question answering. It aims to advance the field towards more generalized models of language comprehension.
  • GSM8K: This dataset challenges AI models with a collection of grade-level math word problems, designed to test computational and reasoning abilities. By requiring models to perform a sequence of arithmetic operations, GSM8K evaluates the AI's capacity for engaging in multi-step mathematical problem-solving.
  • HellaSwag: HellaSwag assesses AI models on their ability to predict the continuation of scenarios, demanding a nuanced understanding of context and narrative. This benchmark pushes the boundaries of predictive modeling and contextual comprehension within AI systems.
  • HumanEval: Targeting the intersection of AI and software development, HumanEval presents programming challenges to evaluate the code generation capabilities of AI models. This benchmark tests models' understanding of coding logic and their ability to produce functional code solutions.
  • ImageNet: A cornerstone in the field of computer vision, ImageNet provides a large-scale dataset for object recognition and classification tasks. It benchmarks the ability of AI models to accurately identify and categorize images, serving as a foundational tool for visual AI research.
  • MMLU (Massive Multitask Language Understanding): MMLU offers a diverse set of language understanding challenges, testing AI models across a broad spectrum of domains and task types. It aims to evaluate and promote the development of AI systems with comprehensive and adaptable language capabilities.
  • MNIST: As a fundamental benchmark in image processing and computer vision, MNIST tests AI models on their ability to recognize handwritten digits. This dataset is pivotal in assessing the basic perceptual and pattern recognition capabilities of AI systems.
  • SQuAD 1.1 and 2.0 (Stanford Question Answering Dataset): These benchmarks evaluate the reading comprehension abilities of AI models, requiring them to extract or infer answers from textual passages. SQuAD 2.0 further introduces the challenge of discerning unanswerable questions, adding a layer of complexity in judgment and inference.
  • SuperGLUE: An extension of GLUE, SuperGLUE presents a set of more demanding language understanding tasks, designed to test the limits of AI models' reasoning, comprehension, and inference capabilities. It serves as a metric for cutting-edge advancements in natural language processing.
  • Switchboard: This benchmark focuses on the processing and understanding of conversational speech, testing AI models on their ability to navigate the complexities of human dialogue. It highlights the challenges in speech recognition and natural language understanding within spontaneous communication.
Retrieved on
April 2, 2024
Citation
This is the citation of the original data obtained from the source, prior to any processing or adaptation by Our World in Data. To cite data downloaded from this page, please use the suggested citation given in Reuse This Work below.
Kiela, D., Thrush, T., Ethayarajh, K., & Singh, A. (2023) 'Plotting Progress in AI', Contextual AI Blog. Available at: https://contextual.ai/blog/plotting-progress (Accessed: 02 April 2024).

How we process data at Our World in Data

All data and visualizations on Our World in Data rely on data sourced from one or several original data providers. Preparing this original data involves several processing steps. Depending on the data, this can include standardizing country names and world region definitions, converting units, calculating derived indicators such as per capita measures, as well as adding or adapting metadata such as the name or the description given to an indicator.

At the link below you can find a detailed description of the structure of our data pipeline, including links to all the code used to prepare data across Our World in Data.

Read about our data pipeline
Notes on our processing step for this indicator

We mapped the benchmarks to their respective domains based on a review of each benchmark's primary focus and the specific capabilities it tests within AI systems:

  • MNIST was mapped to "Handwriting recognition", as it tests AI systems' ability to recognize and classify handwritten digits, a fundamental task in the domain of digital image processing.
  • GLUE was categorized under "Language understanding" due to its assessment of models across a variety of linguistic tasks, highlighting the general capabilities of AI in understanding human language.
  • ImageNet was categorized as "Image recognition", focusing on the ability of AI systems to accurately identify and categorize images into predefined classes, showcasing the advancements in visual perception.
  • SQuAD 1.1 and SQuAD 2.0 were distinguished as "Reading comprehension" and "Reading comprehension with unanswerable questions" respectively. While both benchmarks evaluate reading comprehension, SQuAD 2.0 adds an extra layer of complexity with the introduction of unanswerable questions, demanding deeper understanding and reasoning from AI models.
  • BBH was aligned with "Complex reasoning", as it challenges AI with tasks that require not just logical reasoning but also creative thinking, simulating complex problem-solving scenarios.
  • Switchboard was associated with "Speech recognition" due to its focus on transcribing and understanding human speech within a conversational context, evaluating AI's ability to process and respond to spoken language.
  • MMLU was placed in "General knowledge tests", given its assessment across multiple disciplines and topics, requiring a broad and comprehensive understanding of language.
  • HellaSwag was mapped to "Predictive reasoning" for its evaluation of AI's ability to predict logical continuations within given contexts, testing commonsense reasoning and understanding.
  • HumanEval was categorized under "Code generation", focusing on AI's capability to understand programming languages and generate code that solves specific problems, highlighting skills in logical thinking and algorithmic problem-solving.
  • SuperGLUE was designated as "Nuanced language interpretation" due to its advanced set of linguistic tasks that require deep understanding, reasoning, and interpretation of text, pushing the boundaries of what AI can comprehend.
  • GSK8k was mapped to "Math problem-solving", as it tests AI on solving mathematical problems that involve reasoning and logical deduction, reflecting capabilities in numerical understanding and problem-solving.

Reuse this work

  • All data produced by third-party providers and made available by Our World in Data are subject to the license terms from the original providers. Our work would not be possible without the data providers we rely on, so we ask you to always cite them appropriately (see below). This is crucial to allow data providers to continue doing their work, enhancing, maintaining and updating valuable data.
  • All data, visualizations, and code produced by Our World in Data are completely open access under the Creative Commons BY license. You have the permission to use, distribute, and reproduce these in any medium, provided the source and authors are credited.

Citations

How to cite this page

To cite this page overall, including any descriptions, FAQs or explanations of the data authored by Our World in Data, please use the following citation:

“Data Page: Test scores of AI systems on various capabilities relative to human performance”, part of the following publication: Charlie Giattino, Edouard Mathieu, Veronika Samborska and Max Roser (2023) - “Artificial Intelligence”. Data adapted from Kiela et al.. Retrieved from https://ourworldindata.org/grapher/test-scores-ai-capabilities-relative-human-performance [online resource]
How to cite this data

In-line citationIf you have limited space (e.g. in data visualizations), you can use this abbreviated in-line citation:

Kiela et al. (2023) – with minor processing by Our World in Data

Full citation

Kiela et al. (2023) – with minor processing by Our World in Data. “Test scores of AI systems on various capabilities relative to human performance” [dataset]. Kiela et al., “Dynabench: Rethinking Benchmarking in NLP” [original data]. Retrieved November 23, 2024 from https://ourworldindata.org/grapher/test-scores-ai-capabilities-relative-human-performance