Synthetic Data Decides LLM Progress

Nuclear blasts in the mid 1940s spread tiny bits of radiation across our whole planet, blanketing everything in radionuclides.

One consequence was that steel that had been contaminated could no longer be used for highly sensitive tools, such as Geiger counters and space sensors.

As a result, searches began for so-called “low-background steel” produced before the nuclear era. People went to great lengths, such as raising battle ships sunk during World War I and II.

The release of ChatGPT in 2022 is having an analogous impact on content created before the emergence of generative AI.

The new technology has flooded the internet with machine-generated text, forever changing the landscape of online content.

Instead of uncontaminated steel, what's going to become rare is human-written content created before the era of Gen AI.

These "low-background tokens" which have been untouched by AI's influence will be vital for training future Large Language Models (LLMs). It’s a pristine dataset free from the echo chamber of AI's own voice.

As these datasets vanish, AI risks recycling its outputs. Data scarcity thus poses a significant challenge.

Synthetic data is now the deciding factor of which domains LLMs will see the most improvements in.

Domains with structured outputs and cheap automated verifiers are particularly well-suited for generating infinite high-quality synthetic data.

These domains include coding (verification via compiler and run time), math and formal logic (verification via Prover9, Lean, Isabelle), and controlled natural language systems (e.g. Attempto Controlled English) with strict syntax and semantics that automated reasoners can verify.

In these domains, we can generate a billion code samples or math proofs, run them through a verifier, and toss out 95% of the junk. The 5% that remain can be used as high-quality training data.

We can also refine generation strategies over time to produce harder samples as the model improves.

We should expect to see unprecedented progress in these domains in the near future.

A Memorable Visit to AlUla, Saudi Arabia

I recently spent a few days in AlUla, Saudi Arabia, my first visit to the country. 

It’s a cultural shock, in part because the country only opened its borders to tourists six years ago and got off to a slow start due to Covid.

The government is hoping to take advantage of desert landscapes and ancient rock formations to turn AlUla into a high-end tourist destination. One attraction is the Maraya Concert Hall with its all-mirror design. 

Nearby, there is Hegra, Saudi Arabia’s first UNESCO World Heritage site. Hegra contains tombs and carvings and is often compared with Petra in Jordan. 

The government’s investments highlight a commitment to reshaping the region. It offers a unique opportunity to witness a country in transition, where contrasts abound.

On the one hand, it’s a conservative culture. On the other hand, all the people I met were open and curious about foreigners. I was struck by how friendly everyone was and how that warmth helped bridge cultural and linguistic divides.

Construction machines and vehicles dominate the landscape, reflecting the scale of development. Expatriates perform 95% of the work in AlUla, with men coming from India, Pakistan, and Africa, and women mainly from the Philippines.

There are brand-new roads crisscrossing the desert that link AlUla to nearby projects, including NEOM, the futuristic city along Saudi Arabia’s Red Sea coastline. It’s a massive effort aimed at promoting innovation and sustainability.

If NEOM reflects the vision for the future, Hegra preserves the country’s historical roots.

Some of the challenges of rapid development are evident in the contrast between AlUla’s modern infrastructure and the lack of services. Visitors sometimes struggle to find taxis, for example. 

AlUla’s streets feel safe, making exploration easy. You tend to see Japanese and South Korean cars alongside American muscle cars. I was surprised to see some Saudi women working in visible roles with obvious pride.

The combination of innovation and tradition makes AlUla unique, offering travellers a glimpse into a country balancing its past with ambitions for the future.

It’s a rare opportunity to experience a place in the midst of monumental change. 

I highly recommend a visit.

China Mass Produces Intelligence

The announcement that a team of Chinese researchers has created an open-source large language model has caused deep declines in tech shares traded on the Nasdaq.

On the surface, it’s understandable. The new DeepSeek model delivers high-quality results at a fraction of the cost. Investors are concerned that DeepSeek’s ability to process results 10-30 times cheaper could pop the AI bubble.

I expect the opposite to happen.

If anything, DeepSeek's success shows that we can expect AI models to deliver extraordinary results at an affordable price faster than ever before.

I expect every tech firm to copy the "efficiency gains" championed by DeepSeek. It will mean that every $1 billion investment in AI in the U.S. will result in an extra 10-30 fold gain in intelligence output.

We’ve already read reports that Meta now has four "war rooms" working on replicating all the efficiency techniques published by DeepSeek, and we expect to see the benefits in its upcoming Llama 4 or 5 with a crazy jump in performance.

Short-term, this may be negative for Nvidia. That’s because people won’t need to buy as much hardware as originally planned. Instead, they can turn to engineering optimization.

Longer-term I am bullish on the ability of LLM enhancements to boost productivity and encourage new AI fueled innovation. 

Here is how I expect many of the tech giants to be affected: 

⬆️Apple benefits from lessons learnt in optimizing for memory bandwidth in “edge inference”

⬆️Meta benefits because it wants to deploy AI services everywhere

⬆️Microsoft benefits from selling cheaper Azure LLM services

⬆️Amazon benefits because it didn’t build an LLM and can now host frontier open source models

⬇️Google will be hurt as their advantage in TPU hardware is eroded and new AI competition will emerge for Google services such as search

⬇️AI labs that don't open source will suffer from brutal price competition, i.e. OpenAI, Anthropic and Cohere

There’s certainly an irony that China - known as a country that mass produces consumer goods – has emerged as the source of innovation that may mass produce intelligence!!

It’s a good indication that the focus of capital and attention on training LLMs is coming to an end and will shift to products and services that are being built on top of the technology.

Intelligence Has Never Been This Cheap

The Chinese AI lab recently released several open-source large language models called “DeepSeek v3 and R1” that shocked the world. 

For once, the emphasis was on matching the performance of OpenAI 4o & o1 models but more than 10 times cheaper to train and deploy. 

I expect dramatic declines in costs to trigger a Cambrian explosion of new GenAI applications that start to encroach on traditional software, particularly SaaS products and a wide range of tools for human knowledge work. 

Steep declines in costs will likely shift the focus of the GenAI industry in 2025 to applications, particularly AI agents. Like the early days of the internet, it will be open season for a new generation of GenAI applications to move beyond "chatbots".

The Chinese lab spent an estimated $6 million to train DeepSeek v3 compared with $60-100 million for OpenAI GPT-4.

The inference cost of the DeepSeek v3 is 10x cheaper than the equivalent OpenAI 4o. The inference cost of the DeepSeek R1 reasoning model is 27x cheaper than the equivalent OpenAI o1, even with additional novel features.

If anything, I expect the curve of improvement to get much steeper.

Further improvement in transformer architecture and training algorithms could lead to another 10-fold improvement. New hardware from Nvidia, Cerebras, and Groq could deliver a comparable boost. Combined, that would be a 100-fold decline in costs per unit of performance.

The Chinese open-source LLMs can be hosted on all cloud providers, resulting in brutal competition for AI labs that train foundational models.

With Meta and Chinese AI labs such as Qwen and DeepSeek releasing state of the art LLMs, I expect AI labs like Mistral and Cohere will find it hard to remain independent. 

The large decline in LLM costs and increasing speed gives SigTech the ability to 1) run multiple versions of solutions in parallel 2) tackle increasingly complex jobs via a large team of agents at affordable costs and 3) finetune an open-source LLM to create domain-specific SigTech LLMs.

We believe these developments will transform research in capital markets. 

The moat for research will be driven by access to private proprietary data and analytics.

Any research team that cannot beat this new benchmark will struggle to survive.

Chinese DeepSeek v3 Passes CFA Exams Level-1

A Chinese AI lab just released the best open source LLM: DeepSeek v3. I ran it through my private CFA level 1 exams to compare it against OpenAI o1, gpt-4o and Anthropic claude-sonnet.

The CFA exams were run without access to the CFA curriculum and with zero-shot inference.

Overall DeepSeek v3 delivered solid gpt-4 class performance. It even outperformed both gpt-4o and claude-sonnet in “Quantitative Methods” and “Corporate Finance” chapters.

These are amazing results because DeepSeek v3 is more than 10x cheaper to train and deploy.

DeepSeek v3 was trained on mere 2048 Nvidia H800s at a cost of $5.576M. This is in stark contrast to the 16k GPUs used to train Meta Llama 3.1, the 20k GPUs used to train Grok 2 and the 100k GPUs used to train Grok 3.

It was estimated that GPT-4 was first trained at a cost of $60-70M.

The inference costs of DeepSeek v3 are $0.27 per 1M input tokens and $1.1 per 1M output tokens. These are ~9x cheaper than gpt-4o ($2.5 per 1M input tokens, $10 per 1M output tokens) and ~13x cheaper than claude-sonnet ($3 per 1M input tokens, $15 per 1M output tokens).

We should expect the same Chinese AI lab to release a competitor to OpenAI o1 in the near future. 2025 will be a watershed moment in AI development.

Charlie Munger lives on in LLMs

I never got to meet Charlie Munger in person, but I’ve sought to channel his wisdom for decades.

Munger was content to remain in the shadows compared with his more famous partner, Warren Buffett. Together they ran Berkshire Hathaway and made countless investors wealthy.

I was one of the people who benefited, having owned shares since 2008. 

In 2010, I tried to attend the annual Berkshire shareholder meeting in Omaha but ran into difficulties getting the tourist visa I needed to travel from London.

The folks in the American embassy found it impossible to believe anyone would travel to Omaha, Nebraska on holiday. They started a long background check and I never made it.

Unable to see Munger in person, I devoured his writings. I bought two autographed copies of Poor Charlie’s Almanack, a compilation of Charlie’s speeches compiled by Peter Kaufman. I gave one to a friend who later founded a hedge fund in Hong Kong based on the philosophy of value investing. The other copy still sits in my library.

First published in 2005, the book remains as relevant today as ever. Munger’s multidisciplinary thinking and intellectual rigour haven’t aged a day. The book isn’t just about investing. It’s about learning how to think, how to reason, and how to understand the world.

Most of Munger’s famous quotes – even when they are about making money – transcend investing and seem just as applicable to other pursuits in life. For example: “The big money is not in the buying and selling, but in the waiting." or my favourite quote “The best way to get what you want is to deserve what you want.”

Today we are wrestling with generative AI, a transformational technology. Since Munger passed away in November 2023 at the age of 99 we will never know his thoughts about it.

He was famously skeptical and cautious about new technologies, often stressing that “strange things can happen” when change occurs so quickly.

But I tend to think he would have found large language models useful. LLMs have already surpassed humans in many benchmarks. They are, by design, multidisciplinary, trained on vast amounts of knowledge in multiple languages. Munger would have appreciated the breadth.

He often talked about the psychology of human misjudgment. His list of 25 standard causes of human errors was clever and counterintuitive. He knew the consequence of combining these biases to produce what he called a “lollapalooza” effect.

In a way, LLMs sidestep these pitfalls. They have no hidden motives, no self-interest, no job worries. They don’t get offended. They don’t scheme. They are patient, objective critics. With proper guidance, a LLM can help us identify our blind spots, much like Munger did.

LLMs may provide us with the kind of straight-thinking advice Munger would have given.

For someone often labeled an anti-tech investor, Munger’s spirit endures in LLMs. 

I think he would appreciate the irony.

Build for AGI in 2025

As soon as I gained access to OpenAI o1 model by API, I ran the latest o1, gpt-4o and claude sonnet LLMs through my private CFA level 1 exams.

Now o1 has officially scored in the 99th percentile with a correct rate of 90.7%, without any access to CFA textbooks or mock exams, just a calculator and good prompts.

CFA Level 1 Chapter Total Questions Correct (gpt-4o) Correct (sonnet) Correct (o1)
Ethical and Professional Standards 126 74.60% 73.81% 82.54%
Quantitative Methods 91 75.82% 75.82% 94.51%
Economics 84 88.10% 90.48% 94.05%
Financial Reporting and Analysis 160 78.13% 78.75% 91.88%
Corporate Finance 67 79.10% 71.64% 89.55%
Equity Investments 85 74.12% 69.41% 94.12%
Derivatives 43 74.42% 83.72% 95.35%
Fixed Income 89 73.03% 71.91% 88.76%
Alternative Investments 37 83.78% 81.08% 97.30%
Portfolio Management 57 77.19% 75.44% 85.96%
Total 839 77.47% 76.76% 90.70%

Due to its proprietary nature, CFA materials are not part of a model's pre-training data set. These benchmarks are therefore out of sample.

It's worth noting that o1 scored particularly well in investment-related chapters which are the most relevant to the target users of SigTech.

I expect the o1 performance in "portfolio management" and "fixed-income" to be significantly improved if it's given access to our quant framework as tools.

I also expect the incoming o3 model performance to be >95% correct rate, i.e. 99.9th percentile.

I define AGI as a LLM that's able to do almost all knowledge work in the 99.9th percentile. I fully believe that it will be here by the end of 2025.

It will be expensive to start with but the cost is on an exponential curve of decreasing 90% every 6-12 months.

In 2025, our entire business including our product MAGIC and organisation itself must be built for the arrival of AGI. There is no playbook so we have to think creatively and from the first principles.

I hope everybody has a good holiday to rest and reflect. See you in 2025.

AI reviews EU AI Act

I used OpenAI o1 reasoning LLM to review the EU AI Act.

It took o1 only 18 seconds to think. It took me 2 minutes to convert the PDFs to nicely formatted text.

My prompt is “Give me the top 3 logical contradictions in the EU AI Act I've pasted here: TEXT

Here is the o1 response:

1. A "risk-based" approach that isn’t always risk-based:
The proposal emphasizes a proportionate, risk-based regulatory model, claiming that requirements scale with the potential harm of an AI system. However, the classification of high-risk AI systems is often triggered by the sector or intended use listed in Annex III—regardless of actual, demonstrated risk levels. This means certain AI applications are subject to stringent obligations purely because they fall into a predefined category, contradicting the idea of truly tailoring requirements to the real and context-specific risks posed by the technology.

2. Encouraging innovation while imposing costly compliance burdens:
The Act states it aims to foster innovation and support small and medium-sized enterprises by providing regulatory sandboxes and other guidance. At the same time, it imposes stringent conformity assessments, continuous monitoring, and record-keeping obligations for high-risk AI systems. These demands can become a heavy financial and administrative load, especially for smaller providers. This tension between claiming to nurture innovation and simultaneously mandating complex, resource-intensive compliance contradicts the stated objective of reducing barriers and encouraging market entrants.

3. Harmonizing the single market but layering multiple standards and bodies:

The proposal seeks to create a uniform framework to prevent market fragmentation, ensuring that AI systems can freely circulate throughout the EU. Yet, it references and depends on numerous existing and future sectoral regulations, the involvement of multiple notified bodies, and potential reliance on a patchwork of technical standards and common specifications. As a result, developers and users may navigate a complex web of overlapping rules and authorities. This complexity undercuts the promise of a clear and harmonized single market for AI, creating, in practice, a contradictory situation where the pursuit of uniformity may instead lead to confusion and uneven application.

I’d love  to run this prompt in o1 with all government documents.

Share a Laugh

One of the best things I did this year was to take a 10-week stand up course by comedian Logan Murray.

It culminated in a live performance before 100 people at the Museum of Comedy, which is located next to the British Museum in London. 

When I invited close friends and colleagues to the show, most were shocked. “That is so scary. Why would you do that?” they wondered. 

I didn’t do it for the reasons many do. I wasn’t trying to improve my public speaking skills or reduce stage fright. I certainly wasn’t preparing for a side hustle or career transition. 

It was more philosophical. Humour is so fundamental to human nature. I wanted to understand what makes a joke funny and why people laugh. How is humour related to curiosity and inquisitiveness? 

I knew my friends would understand once they saw the show. 

They are all curious, intelligent people. What I love most about them is that they share a sense of humour. They don’t take themselves too seriously. Over the years, we’ve bonded over laughter far more often than over food, drinks or holidays.

There is something about “sharing a laugh” that’s both primitive and profound. 

I took the stand up comedy course to learn about this eternal form of human bonding.

Over 10 weeks, I studied with a group of total strangers. There were three semi-pro comedians, one ex-talk show host, a former concert singer, one private equity analyst, an entrepreneur, a former professor, a body guard and me. 

We were taught key comedic concepts of status, spontaneity and storytelling. We critiqued each other’s performances. Everyone had a distinct style. We rarely talked about what we did in “real life.”

It felt strange to know so much about each other and yet nothing at all.

But it was a wonderful bonding experience and a beautiful memory.

Like the memory of visiting Greenland with total strangers in 2016.

Know the Game

Mathematical analysis was my nemesis.

I had nightmares of “infinitesimal quantities”.

One of the biggest shocks of my life 25 years ago set in motion the chain of events that led me to start a financial AI company, SigTech.

It starts with my roots growing up in Shanghai, China’s largest city with 25 million inhabitants.

It’s a vibrant place filled with entrepreneurial energy. In Beijing, when people go out, they gossip about politics. In Shanghai, people talk about money. 

When you grow up in China, you get used to the idea that whatever you do, a million other people are trying to do the same thing. You are constantly under pressure to compete because it’s drilled into your brain that no one's going to give you anything. You have to earn it.

You have to be an excellent student and go to excellent schools to get an excellent job. That’s the way it works. So, I worked hard to get good grades and won national Olympiads in math and physics. Being a top student was my identity.

When I was given the chance to choose any major in any university in China, I chose to join a class of 20 students of “special talent” assembled at Shanghai Jiaotong University, one of the world’s top universities.

I quickly realized I was not going to be the best. In the mathematical analysis class I took, I was ranked in the bottom half.

It was shocking!

Even worse, no matter how much effort I put in, I simply lacked intuition for “epsilon”.

At the time I thought it was inconceivable. It challenged my self-identity. If I can’t be the best in math, then who am I? What will I do with my life?

It was a critical period and it prompted me to start to think more about what I wanted, instead of what I was supposed to want. It’s not a common thing in China.

The reflection prompted me to take a different path than I had planned, eventually applying to Cambridge University for a doctorate, and working in finance in London afterwards.

Those lessons helped me, as a computer scientist, transition into broader roles, first at Barclays Capital and then at Brevan Howard.

I developed a sense of independence, fostered by the variety of jobs. When people told me to do something, I was less likely to blindly say yes. Instead, I asked more questions.

Looking back, the thing that probably contributed most to my professional development was being forced to sit in that room of 20 other students at university and acknowledge I wasn’t ever going to be at the top of the class.

Sometimes it takes realizing you aren’t as good at the game as you thought, to change your mind about the game you want to play.

Like a hunting cheetah in the Kruger National Park, South Africa, 2018