Computational Complexity: Machine Learning and Complexity

Thursday, April 16, 2026

Machine Learning and Complexity

At Oxford I focused my research and discussions on how we can use the tools of computational complexity to help us understand the power and limitations of machine learning. Last week I posted my paper How Does Machine Learning Manage Complexity?, a first step in this direction. Let me give a broad overview of the paper. Please refer to the paper for more technical details.

Instead of focusing on the machine learning concepts like neural nets, transformers, etc., I wanted to abstract out a model defined in terms of complexity, as time-efficient non-uniform computable distributions with minimum probabilities. Let's unpack this abstraction.

Neural nets as next token predictors don't always give the same next token, rather they give a distribution of tokens, which leads to a distribution on text of length up to the size of the context window. The way probabilities are computed via a softmax function guarantee that every output can occur, with at least an exponentially small probability, the "minimum probability" in the abstraction.

In computational complexity, we study two main kinds of distributions. A distribution is sampleable if there is an algorithm that takes in uniformly random bits and outputs text according to that distribution. A distribution is computable if we can compute not only the probability that a piece of text occurs, but the cumulative distribution function, the sum of the probabilities of all outputs lexicographically less than that text. Every efficiently computable distribution is efficiently sampleable but not likely the other way around.

Neural nets as next token predictors turn out to be equivalent to computable distributions. We need these kinds of distributions both on how neural nets are trained but also on how we use them. Computable distributions allow for conditional sampling which lets us use a large-language model to answer questions or have a conversation. You can't get conditional sampling from a sampleable distribution.

Neural nets have a limited number of layers, typically about a hundred layers deep which prevents them from handling full time-efficient (polynomial-time) computations. They can get around this restriction either with reasoning models that allow a model to talk to itself, or by directly writing and executing code which run efficient algorithms of any kind.

Most of the algorithms we use, think Dijkstra, or matching, are uniform, that is the same algorithm on graphs of twenty nodes as the algorithm on twenty million. But neural nets change their weights based on the distributions they train on. These weights encode a compressed view of that data and the extra computation needed to process that data. So we have different algorithms as our technology and data sources improve. I wrote more about this connection between AI and nonuniformity last fall.

What does this abstraction buy us?

Restricting to computable distributions forces machine learning algorithms to treat complex behavior as random behavior, much like we treat a coin flip as a random event because it is too complicated to work out all the physical interactions that would determine which side it lands on.

We illustrate this point by our main result. Let D be the distribution of outputs from a cryptographic pseudorandom generator and U be the uniform distribution of words of the same length. If \(\mu\) is a time-efficient non-uniform computable distribution with minimum probabilities and \(\mu\) does as good or a better job than U of modeling D then \(\mu\) and U are essentially the same distribution. Machine learning models the complex PRG as a uniform distribution, simplifying what we can't directly compute by using randomness.

More in the paper and future blog posts.

3 comments:

David, a Bostonian in Tokyo9:54 PM, April 16, 2026
I wish people had a better technical term for locally connected, layered, SIMD models. Sure, they're useful and have interesting theoretical aspects and properties. But.

A cubic millimeter of mouse cortex contains about 9×10^4 neurons, 7×10^8 synapses, and 4 km of local, unmyelinated cabling (axon collaterals) down to 0.1µm diameter (0.3µm average diameter).

That is, a generic cortical neuron has hundreds of inputs, thousands of outputs, and connects to many extremely distant targets.

(And this is _OLD_ news. If you read Science, every other month or so, there's a neuroanatomy paper that shows that neurons are more complex than anyone thought.)

Said badly named SIMD models don't look anything like anything found in any mammalian cortex.

Please. Stop. It's technically incorrect.
ReplyDelete
Replies
Anonymous1:10 AM, April 17, 2026
can you give a bit more intuition about the main result?

iiuc, you are essentially saying that these ML models can only approximate a PRG only if the PRG is statistically close to uniform, i.e. they cannot approximate many PRGs.

is there any interesting PRG that is not close to uniform that one would want to use an ML model to approximate?
ReplyDelete
Replies
Anonymous9:14 AM, April 27, 2026
Your post on the distinction between sampleable and computable distributions in ML is fascinating. It strikes me that the reason we often settle for 'sampleable' in ML is rooted in a physical reality: P vs NP isn't just a binary logic gate, but a spectrum of balance.
If we model P problems as neurons and NP as the high-dimensional weight-connections between them, a 'correct' answer (a computable distribution) requires a perfect equilibrium that entropy rarely allows. In this view, ML doesn't just 'simplify' what we can't compute; it finds a physical compromise.
As I see it, the reason a 'computable' Cumulative Distribution Function (CDF) is so much harder is because it requires 'traversing' the entire space—something that, in a physical circuit, would require 'folding' space-time (like a wormhole) to bypass the time-cost of computation. Since we can't 'teleport' to the answer, we use ML to 'sample' and approximate.
Perhaps P \neq NP is the 'Universal Friction' that prevents us from reaching that 100% computable certainty. Nature, much like our brains, seems to prefer a stable 'approximation' over a perfect, yet energy-expensive, mathematical truth. I wonder if complexity theory will eventually have to define 'hardness' through this lens of physical equilibrium and energy-cost rather than just time-steps
ReplyDelete
Replies

Add comment