Nuclear blasts in the mid 1940s spread tiny bits of radiation across our whole planet, blanketing everything in radionuclides.
One consequence was that steel that had been contaminated could no longer be used for highly sensitive tools, such as Geiger counters and space sensors.
As a result, searches began for so-called “low-background steel” produced before the nuclear era. People went to great lengths, such as raising battle ships sunk during World War I and II.
The release of ChatGPT in 2022 is having an analogous impact on content created before the emergence of generative AI.
The new technology has flooded the internet with machine-generated text, forever changing the landscape of online content.
Instead of uncontaminated steel, what's going to become rare is human-written content created before the era of Gen AI.
These "low-background tokens" which have been untouched by AI's influence will be vital for training future Large Language Models (LLMs). It’s a pristine dataset free from the echo chamber of AI's own voice.
As these datasets vanish, AI risks recycling its outputs. Data scarcity thus poses a significant challenge.
Synthetic data is now the deciding factor of which domains LLMs will see the most improvements in.
Domains with structured outputs and cheap automated verifiers are particularly well-suited for generating infinite high-quality synthetic data.
These domains include coding (verification via compiler and run time), math and formal logic (verification via Prover9, Lean, Isabelle), and controlled natural language systems (e.g. Attempto Controlled English) with strict syntax and semantics that automated reasoners can verify.
In these domains, we can generate a billion code samples or math proofs, run them through a verifier, and toss out 95% of the junk. The 5% that remain can be used as high-quality training data.
We can also refine generation strategies over time to produce harder samples as the model improves.
We should expect to see unprecedented progress in these domains in the near future.