Flirting with Models S7E2: text2quant

I had the pleasure of speaking with Corey Hoffstein, the co-founder and Chief Investment Officer of Newfound Research, on Flirting with Models podcast:

In this episode I speak with Bin Ren, founder of SigTech, a financial technology platform providing quantitative researchers with access to a state-of-the-art analysis engine.

This conversation is really broken into two parts. In the first half, we discuss Bin’s views on designing and developing a state-of-the-art backtesting engine. This includes concepts around monolithic versus modular design, how tightly coupled the engine and data should be, and the blurred line between where a strategy definition ends and the backtest engine begins.

In the second half of the conversation we discuss the significant pivot SigTech has undergone this year to incorporate large language models into its process. Or, perhaps more accurately, allow large language models to be a client to its data and services. Here Bin shares his thoughts on both the technical ramifications of integrating with LLMs as well as his philosophical views as to how the role of a quant researcher will change over time as AI becomes more prevalent.

You can listen to it here, on Spotify or Apple Podcasts.

Scaling LLMs Part 5: When will we run out of “low-background tokens”?

Let’s start by talking about “low-background steel”, a story that began in the shadow of World War II, a silent witness to a seismic shift in human history.

This unique form of steel was produced before the advent of nuclear weapons in the 1940s and 1950s. The detonation of the first nuclear bomb in July 1946 marked the beginning of widespread nuclear fallout, which introduced radionuclides like cobalt-60 into the atmosphere. These particles, carried by the air, contaminated materials across the globe, including steel produced post-nuclear era.

The significance of low-background steel arises from its scarcity and its critical role in scientific research, especially in fields where precise detection of radiation is paramount.

It became an irreplaceable resource for constructing sensitive instruments, such as Geiger counters and equipment for photonics research. Because of its production date, it lacks the radioactive contamination that post-war steel possesses, thus enabling more accurate readings in radiation detection and research.

The dwindling supply of low-background steel has led to intriguing stories, such as salvagers seeking out World War II-era shipwrecks, which are speculated to have been targeted for their low-background steel.

Now let’s talk about “low-background tokens”. These tokens are not crypto currency, but the input data used to train Large Language Models (LLMs). As the digital atmosphere burgeoned with AI-generated content, the introduction of OpenAI ChatGPT in November 2022 can be likened to a nuclear event for data.

The tokens from before this proliferation are untainted by the recursive feedback loop of AI creations—they are the low-background tokens of our time, a pristine dataset free from the echo of AI's own voice.

The significance? With the exponential growth of AI, the undisturbed datasets from before this 'detonation' of AI-generated content are a dwindling resource. Their value lies in their untouched provenance, offering a baseline for training future AIs with original, unaltered human input.

As we gradually lose access to these tokens, we face a potential future where AI's training is a reflection not just of original human thought, but of a recursive, self-referential digital creation. The impact could be profound, influencing how AI understands and interacts with the world, based increasingly on its own generated output rather than pure human output.

Our challenge is to recognize the worth of these low-background tokens and to strategize their use wisely. They are the bedrock of authenticity in a sea of synthesized information, enabling us to train AIs that can genuinely comprehend the human values: untainted and true.

Why was Sam Altman ousted? A Story of AGI and Governance

A recent leadership shift at OpenAI raises pivotal questions about Artificial General Intelligence (AGI) and its governance. Sam Altman's departure as CEO follows diverging views within OpenAI's board on what AGI truly means and how to handle its potential arrival.

According to OpenAI, AGI is a "highly autonomous system that outperforms humans at most economically valuable wo​​rk”. However, Altman recently set a higher bar for AGI, including the "discovery of new types of physics" during a recent talk at the Cambridge Union, England.

This definition matters. With OpenAI's partnership with Microsoft, the tech giant has access to AI models below the AGI threshold.

The broader Altman's AGI definition, the more advanced AI models fall into Microsoft's ambit, as they're classified as pre​​-AGI.

Internal developments suggest OpenAI might be closer to AGI than expected, leading to a strategic divide.

While Altman seemed inclined towards broader distribution, including Microsoft's utilization, others, like Ilya Sutskever, responsible for AGI alignment, favored a more cautious approach.

This clash of visions — between unleashing potential AGI advancements and ensuring rigorous safety and alignment — might have catalyzed Altman's exit.

The OpenAI board, committed to "safe AGI that is broadly benefi​​cial”, had to make a tough call.

Stay tuned as the event continues to unfold.

Scaling LLMs Part 3: A Review of the Major Breakthroughs in Reinforcement Learning by Google DeepMind

Welcome back to our series on scaling Large Language Models (LLMs)! Following our discussion on how compute trumps human heuristics, let's delve into Reinforcement Learning (RL), where this principle is vividly exemplified. RL is a method of training AI by rewarding desired behaviors and learning from interaction.

1. Success of AlphaGo:

AlphaGo's triumph was primarily due to combining deep neural networks with Monte Carlo tree search (MCTS). It learned from human game records and then through self-play, leading to superhuman Go strategies.

In March 2016, AlphaGo beat Lee Sedol in a five-game match, the first time a computer Go program has beaten a 9-dan professional without handicap.

2. From AlphaGo to AlphaZero:

AlphaZero represented a major leap, learning entirely through self-play without prior human game data. This approach was applied to not only Go but also chess and shogi (Japanese chess), showcasing a shift from specialized to generalized AI learning systems.

3. From AlphaZero to MuZero:

MuZero extended AlphaZero's capabilities to games with unknown dynamics, learning from environmental interactions without needing game rules. It combined a world model with deep learning and MCTS, advancing towards AI that understands and interacts with complex environments.

4. RL Plays a Central Role in LLM Alignment:

RL is crucial in LLM alignment, particularly in Reinforcement Learning with Human Feedback (RLHF), where LLMs are refined using human feedback for alignment with ethical guidelines.

Looking ahead, RL could significantly enhance LLM training and inference, a topic we'll explore in our next post. RL might guide LLMs towards more accurate, context-aware, and ethically aligned outputs, heralding a new era in AI innovation.

Scaling LLMs Part 4: How Q* by OpenAI works and why it dramatically accelerates the push toward AGI

Welcome to our exploration of Q*, a groundbreaking development in AI, particularly in scaling Large Language Models (LLMs).

In short, Q* makes LLMs PONDER.

1. What Q* Represents:

Q* stands for a new technique that leverages reinforcement learning (RL), akin to MuZero-style RL by Google DeepMind, to enhance LLMs' capabilities. It's an approach where the model learns optimal behaviors through a process similar to trial and error.

2. How Q* Works:

Q* adopts MuZero-style RL, using Monte Carlo Tree Search (MCTS) combined with a reward function learning mechanism. This allows Q* to simulate a vast array of potential actions (predictions of tokens) and learn from the best outcomes, greatly improving its decision-making process in multiple steps.

For example, a LLM can predict each token 100 times and then learn which sequence of tokens ultimately leads to the best outcome (the highest reward).

3. Mathematics as a String Manipulation Game:

In the realm of formal mathematics, systems like ZFC are built on "starting strings" (axioms), "string generators" (axiom schema), and "string manipulation rules" (laws of inference).

The goal is to use these rules on starting strings or strings generated from templates to produce a target string, known as a theorem.

This framework aligns well with how Q* approaches mathematical problem-solving, treating math as a game of string manipulations.

4. Q* and Mathematical Problem Solving:

Q* generates a 'tree of thoughts' by predicting the next tokens multiple times, akin to exploring different branches in MCTS. Each branch represents a potential solution path, and the model learns to effectively navigate this tree with the feedback on the correctness of each solution.

5. Trading Off Inference Time for Quality:

Q* trades increased inference time for enhanced quality of outcomes. By spending more time analyzing each decision (e.g. predicting the next token 100x), Q* achieves a level of inference quality that rivals much larger models.

6. Small Model, Big Impact:

This strategy enables a smaller model to deliver the performance of a significantly larger one. It's an efficient way to scale up the capabilities of LLMs without proportionally increasing their model size, data size and compute.

7. Overcoming Data Scarcity with Synthetic Data:

Intriguingly, Q*'s approach of learning from its best predictions is akin to training with self-generated synthetic data. This method effectively addresses one of the major challenges in scaling LLMs: data scarcity.

By generating and learning from its own predictions, Q* paves the way for more efficient and scalable AI models, marking a new era in AI research and applications.

OpenAI’s newly announced “document retrieval” doesn’t really work for finance?!

All LLMs today need Retrieval Augmented Generation (RAG). Why? To access private knowledge and to work around the small context window.

GPT-4 is the state-of-art reasoning engine that was trained on a vast public corpus of human knowledge across numerous domains.

But it has NO access to our private knowledge in the form of proprietary documents, internal chat history, video conference transcripts, customer information and commercial contracts etc. It’s also RARELY possible to inject the entire private knowledge into the context window which is limited to 128k.

For a given query, RAG works by doing a semantic search of the corpus of private knowledge and retrieving only the relevant parts as the context for a LLM.

However the latest announced RAG in GPT-4-Turbo doesn’t work out of the box for most applications in finance because:

1) One has to manually upload up to 20 files, with each file limited to 512MB.

2) Only a few file types are supported (e.g. pdf, csv, txt)

3) There is no connectivity to any data source (e.g. database, data lakes, document stores).

4) The cost of storing these files is very steep: $0.20/GB/agent/day, which is about 260x of AWS S3 standard pricing at $0.023/GB/month.

Also. LLMs in general (including RAG)l are fundamentally terrible at time series analysis.

At SigTech we combine the state-of-art tools and data sets in finance with LLMs to maximise the productivity of our human users.

Scaling LLMs Part 2: The Triumph of Compute Over Human Heuristics

Welcome back to our series on scaling Large Language Models (LLMs)! Following our exploration of multi-modality's impact on enhancing LLM learning, let's dive into another pivotal aspect: the role of compute.

- The Dominance of Generic Training Methods:

Picture a scenario where you're trying to optimize a complex system. Initially, you might apply specialized rules based on human understanding, akin to using a detailed map for navigation.

However, as computational power grows, a more effective approach emerges: generic training methods that leverage this compute power. It's like switching from using a map to a GPS that continuously learns and updates the best routes.

In AI, this principle holds true: Generic training methods with more compute always trump human-crafted heuristics.

- Embracing Complexity in AI Development:

A key insight from our exploration of compute is recognizing the immense complexity of human cognition. Unlike simple models of space, objects, or agents, human thought processes are deeply intricate.

In scaling LLMs, we aim not to encode these complexities directly but to develop meta-methods that enable AI to discover and navigate this complexity on its own. It’s about equipping AI to find patterns and approximations in data as humans do, rather than pre-loading it with our existing knowledge.

This approach allows AI to evolve and adapt in ways that mimic human discovery and learning.

- Real-World Example of AlphaGo to AlphaZero by Google DeepMind:

Initially, AlphaGo learned from human players, much like a student learning from a textbook. But AlphaZero changed the game. It learned by playing against itself, akin to a student who learns not from books, but by experimenting and discovering new knowledge independently.

This shift from human-guided learning to self-exploration and self-improvement showcases the power of computation in AI development.

- The Future of AI:

Envision a world where AI can not only learn from what it's been taught but can also innovate and discover new ideas, much like an artist who evolves from imitating others to creating their unique style.

This future of AI, where originality and creativity flourish, is powered by the relentless growth of computational capabilities.

Scaling LLMs Part 1: Why multi-modality is key to accelerating LLM learning: a journey from text to image and video

- Predicting the next token is understanding:

Large Language Models (LLMs) like GPT-4 don't just learn languages; they learn about the world. By predicting the next token in a vast array of texts, these models gradually build a 'world model'.

This means they're not only understanding language structure but also grasping the complex web of human knowledge, behavior, and societal norms. Essentially, they learn how the world works, one word at a time.

Now, imagine you're deeply immersed in a detective novel, rich with clues, complex characters, and twisted plots. The story builds to a climax where the detective declares, "Now I'm going to reveal the name of the murderer, and it is ___"

If an AI model can correctly predict the next word that fills this blank, it's showing an understanding that extends far beyond language. To accurately complete this sentence, the AI must understand the entire novel - every plot twist, character arc, and subtle hint.

This analogy vividly demonstrates how predicting the next word in a sequence requires and reflects a profound understanding of the context.

- Enhancing AI's Perception with Visual Data:

Over 50% of our brain's cortex is dedicated to visual processing, highlighting the importance of visual information in human understanding. Similarly, when AI incorporates visual data, it undergoes a transformative shift in comprehension.

A compelling example is how LLMs, despite not having 'seen' a single photon, gradually develop an understanding of colors. This information is indirectly 'leaked' into the AI's learning through the vast textual data it processes.

This process mirrors how human understanding of concepts like color can be shaped through descriptions, even without direct visual experience.

But by integrating visual data, AI's learning can be dramatically accelerated, much like adding a powerful new sense to its existing capabilities.

- Videos: The Next Frontier:

Incorporating videos into AI's learning process adds the dimension of time and motion. Videos help AI understand how objects and entities interact and change over time, following the physical laws of our universe.

It's the difference between a static picture of a bird and a video showing the fluid motion of its flight. By learning from videos, AI not only recognizes but understands the dynamics of the world around us, completing its transition from static observer to dynamic participant.

In this way, AI's progression from text to imagery, and eventually to videos, reflects a deepening and broadening of its understanding, paralleling the multi-faceted way humans perceive and interact with the world around us.

Oliver Wyman Innovators' Exchange: Potential Applications Of Generative AI In Quant Trading

I had the pleasure to speak with Hitel Patel, partner, global head of financial infrastructure, technology, and services at Oliver Wyman, on Innovators' Exchange podcast:

In the latest episode of Innovators' Exchange, Hiten Patel speaks with Bin Ren, the CEO and co-founder of SigTech, a leading provider of quant technologies. The discussion centers around democratizing access to quant trading strategies, the surge of retail investing, and the profound implications of generative AI (Gen AI) in the financial markets.

Bin unveils the story behind SigTech's mission to accelerate the idea-to-market process in capital markets. Their focus on reducing the lifecycle of ideas from months to seconds empowers traders and portfolio managers, democratizing the landscape for both professionals and retail traders. The company's cutting-edge technology, featuring Gen AI, enables quick generation, testing, and deployment of trading ideas, contributing to a more efficient and accessible market.

You can listen to it here, or on Spotify and Apple Podcasts.