Why are LLMs so good at everything ?

Steve Crossan
3 min readMar 23, 2023

LLMs are incredible chatbots. But they are also super impressive at a wide range of tasks, from writing code to predicting the properties of novel molecules. Why is that ?

Let’s start with why “just make it big” turned out to be such a good bet for question-answering systems. It wasn’t so long ago that insiders were dismissing this as a dead end after all. Is brute force (masses of data and masses of parameters and training time) all you need ?

This is the story of “double descent” — models that get better with more parameters, then get worse as they overfit the training data, but then get better again as you push forward into yet more parameters.

It turns out that, at least in some domains, as you increase the number of model parameters to the order of the size of your training data you do indeed get the familiar “overfitting”. You get a model that’s very good at predicting within its training set but generalises very poorly.

But as you go beyond this regime with yet more parameters, the model starts to get better at generalising again (‘double descent’ refers to the way the error initially reduces, then goes up as you overfit, then goes down again as you increase the number of parameters even further).

The intuition here is that with a huge number of parameters (orders of magnitude larger than the training data) every model is able to fit the training data, so what distinguishes them is their ability to interpolate in between to unseen data.

That makes LLMs very good chatbots, but also very good general software tools. Why ? I think it’s because as it gets very good at extrapolating from seen to unseen (usually never seen) text, the most efficient way is to have an implicit representation of the domain embedded in the model. So a model that’s only been trained on text (including some text about chemistry) can answer relatively complex research questions about novel molecules — because it’s had to effectively build an implicit physico-chemical model to perform well at the language task it was trained on.

There’s a repeated pattern in which the first model to conquer a particular milestone (GPT3, DALL-E, AlphaFold) is quickly followed by models that cost orders of magnitude less to train but have similar or near-similar performance. Partly that’s people making different quality-cost tradeoffs, but often it happens because the expensive model has revealed some structure about the domain which we can then build into the model architecture as an inductive bias (e.g. what we know about phsyics and chemistry).

So an interesting question: will most domains be dominated by LLMs that have implicitly learned that domain (will the teams with the biggest training budgets “win” in every domain)?

--

--

Steve Crossan

Research, investing & advising inc in AI & Deep Tech. Before: Product @ DeepMind. Founded Google Cultural Institute; Product @ Gmail, Maps, Search. Speak2Tweet.