Datapoints used to train notable artificial intelligence systems

See all data and research on:

What you should know about this indicator

  • Training data size refers to the volume of data employed to train an artificial intelligence (AI) model effectively. It's a representation of the number of examples that the model learns from during its training process. It is a fundamental measure of the scope of the data used in the model's learning phase.
  • To grasp the concept of training data size, imagine teaching a friend the art of distinguishing different types of birds. In this analogy, each bird picture presented to your friend corresponds to an individual piece of training data. If you showed them 100 unique bird photos, then the training data size in this scenario would be quantified as 100.
  • Training data size is an essential indicator in AI and machine learning. First and foremost, it directly impacts the depth of learning achieved by the model. The more extensive the dataset, the more profound and comprehensive the model's understanding of the subject matter becomes. Additionally, a large training data size contributes significantly to improved recognition capabilities. By exposing the model to a diverse array of examples, it becomes adept at identifying subtle nuances, much like how it becomes skilled at distinguishing various bird species through exposure to a large variety of bird images.
Datapoints used to train notable artificial intelligence systems
The number of examples provided to train an AI model. Typically, more data results in a more comprehensive understanding by the model.
Epoch (2024) – with major processing by Our World in Data
Last updated
June 3, 2024
Next expected update
August 2024

Sources and processing

This data is based on the following sources

Retrieved on
June 3, 2024
This is the citation of the original data obtained from the source, prior to any processing or adaptation by Our World in Data. To cite data downloaded from this page, please use the suggested citation given in Reuse This Work below.
Epoch AI, ‘Parameter, Compute and Data Trends in Machine Learning’. Published online at Retrieved from: ‘’ [online resource]

How we process data at Our World in Data

All data and visualizations on Our World in Data rely on data sourced from one or several original data providers. Preparing this original data involves several processing steps. Depending on the data, this can include standardizing country names and world region definitions, converting units, calculating derived indicators such as per capita measures, as well as adding or adapting metadata such as the name or the description given to an indicator.

At the link below you can find a detailed description of the structure of our data pipeline, including links to all the code used to prepare data across Our World in Data.

Read about our data pipeline

Reuse this work

  • All data produced by third-party providers and made available by Our World in Data are subject to the license terms from the original providers. Our work would not be possible without the data providers we rely on, so we ask you to always cite them appropriately (see below). This is crucial to allow data providers to continue doing their work, enhancing, maintaining and updating valuable data.
  • All data, visualizations, and code produced by Our World in Data are completely open access under the Creative Commons BY license. You have the permission to use, distribute, and reproduce these in any medium, provided the source and authors are credited.


How to cite this page

To cite this page overall, including any descriptions, FAQs or explanations of the data authored by Our World in Data, please use the following citation:

“Data Page: Datapoints used to train notable artificial intelligence systems”, part of the following publication: Charlie Giattino, Edouard Mathieu, Veronika Samborska and Max Roser (2023) - “Artificial Intelligence”. Data adapted from Epoch. Retrieved from [online resource]
How to cite this data

In-line citationIf you have limited space (e.g. in data visualizations), you can use this abbreviated in-line citation:

Epoch (2024) – with major processing by Our World in Data

Full citation

Epoch (2024) – with major processing by Our World in Data. “Datapoints used to train notable artificial intelligence systems” [dataset]. Epoch, “Parameter, Compute and Data Trends in Machine Learning” [original data]. Retrieved July 15, 2024 from