Scaling LLMs Part 5: When will we run out of “low-background tokens”?

Let’s start by talking about “low-background steel”, a story that began in the shadow of World War II, a silent witness to a seismic shift in human history.

This unique form of steel was produced before the advent of nuclear weapons in the 1940s and 1950s. The detonation of the first nuclear bomb in July 1946 marked the beginning of widespread nuclear fallout, which introduced radionuclides like cobalt-60 into the atmosphere. These particles, carried by the air, contaminated materials across the globe, including steel produced post-nuclear era.

The significance of low-background steel arises from its scarcity and its critical role in scientific research, especially in fields where precise detection of radiation is paramount.

It became an irreplaceable resource for constructing sensitive instruments, such as Geiger counters and equipment for photonics research. Because of its production date, it lacks the radioactive contamination that post-war steel possesses, thus enabling more accurate readings in radiation detection and research.

The dwindling supply of low-background steel has led to intriguing stories, such as salvagers seeking out World War II-era shipwrecks, which are speculated to have been targeted for their low-background steel.

Now let’s talk about “low-background tokens”. These tokens are not crypto currency, but the input data used to train Large Language Models (LLMs). As the digital atmosphere burgeoned with AI-generated content, the introduction of OpenAI ChatGPT in November 2022 can be likened to a nuclear event for data.

The tokens from before this proliferation are untainted by the recursive feedback loop of AI creations—they are the low-background tokens of our time, a pristine dataset free from the echo of AI's own voice.

The significance? With the exponential growth of AI, the undisturbed datasets from before this 'detonation' of AI-generated content are a dwindling resource. Their value lies in their untouched provenance, offering a baseline for training future AIs with original, unaltered human input.

As we gradually lose access to these tokens, we face a potential future where AI's training is a reflection not just of original human thought, but of a recursive, self-referential digital creation. The impact could be profound, influencing how AI understands and interacts with the world, based increasingly on its own generated output rather than pure human output.

Our challenge is to recognize the worth of these low-background tokens and to strategize their use wisely. They are the bedrock of authenticity in a sea of synthesized information, enabling us to train AIs that can genuinely comprehend the human values: untainted and true.