Training Data
The dataset used to teach a machine learning model, containing the examples and patterns the model learns to recognize and reproduce.
Why it matters
Training data determines what a model knows and what biases it carries. The adage "garbage in, garbage out" applies directly — a model cannot learn patterns that are not in its data, and it will reproduce any biases present in it.
What goes in
For large language models, training data typically includes web pages, books, academic papers, code repositories, and other text sources — often measured in trillions of tokens. For specialized models, training data might be medical records, legal documents, or engineering specifications.
Quality over quantity
Early AI research emphasized dataset size. Current research increasingly emphasizes data quality and curation. A smaller, carefully curated dataset often produces better models than a larger, noisy one. This is why companies invest heavily in data filtering, deduplication, and quality scoring.
Key considerations
- Bias — training data reflects the biases of its sources. If the data overrepresents certain viewpoints or demographics, the model will too.
- Freshness — models have a knowledge cutoff. They cannot know about events that happened after their training data was collected.
- Copyright and consent — using copyrighted content for training is legally contested. Some datasets now emphasize openly licensed or consensually contributed content.
- Contamination — if evaluation benchmarks appear in training data, the model may score well on tests without genuinely understanding the material.