How AI works · what it learnt from

Training data

#beginner

30-second gist~30s read

Training data is the enormous pile of text an AI was fed during its early life. Most modern AIs trained on a mix of: large parts of the open internet, scanned books, news, scientific papers, code repositories, and Wikipedia.

Whatever's in that pile is the air the AI breathes. Biases in the pile become biases in the AI. Gaps in the pile become blind spots. Stale data becomes outdated answers.

If you want more

What's typically in there?~1 min

Most big AIs are trained on a base of Common Crawl (a free archive of web pages), Wikipedia, large book collections (some legal, some contested), code from GitHub, news archives, and academic papers. On top of that, the model is then refined with carefully curated human conversations to teach it to be helpful and not harmful.

Crucially, training has a cutoff date. Anything after that — events, prices, new people, new science — the AI did not see. Some chatbots can now look things up live, but the underlying knowledge is still frozen.

Why it matters to you~30s

The AI inherits the internet's lopsidedness: more English than Tamil, more code than poetry, more men than women, more recent than historical. Asking the AI for a recommendation is asking for the average opinion of the open web — sometimes wise, sometimes not.

If the answer matters, ask: was this in the training data? If yes, the AI knows. If it's a recent event, a private document, or a niche topic, treat the answer as a guess.

A real lawsuit

In December 2023 the New York Times sued OpenAI and Microsoft, alleging that millions of NYT articles were used to train ChatGPT without permission. The case is ongoing. Similar suits have followed from authors, publishers, and image makers — all centred on whether training counts as fair use.