Computational Complexity: AI and the Power of Nonuniform Circuits

Wednesday, October 29, 2025

AI and the Power of Nonuniform Circuits

The advances in artificial intelligence have changed the way we think about computing. For today's post, how nonuniformity plays a much larger role than I previously believed.

First, some background. Circuit complexity gives a different, more combinatorial, way to look at computation. Instead of viewing algorithms as step-by-step programs, circuit complexity captures computation as fixed hardware, gates wired to compute a function of their inputs.

Every polynomial-time computable language L can be computed by a family \(C_0,C_1,\ldots\) of polynomial-size circuits, where \(C_n\) has n inputs, the number of gates of \(C_n\) is bounded by \(n^c\) for some \(c\) and \(x=x_1\ldots x_n\) is in L iff \(C_n(x_1,\ldots,x_n)\) outputs 1.

The converse isn't true, there are languages L that have polynomial-size circuits that you can't compute in polynomial time, because each \(C_n\) might not have any relation to \(C_m\) for \(n\neq m\). We can get the equivalence by adding a uniformity condition, that there is some easily computable function \(f\) with \(f(1^n)=C_n\). Or we can keep the nonuniformity to define a larger class called P/poly.

In practice, nonuniformity rarely helps. We can hide randomness in the nonuniformity but we could likely also use pseudorandom generators to the same effect. Nonuniformity was more a technicality that we had to deal with instead of something that seemed to give circuits significantly more power.

Until we reached this new age of AI. A machine learning model like a neural net goes through a lengthy training process to learn its weights, the training optimized offline. The learned weights become the nonuniform circuits, as technology improves and we create larger retrained circuits, the weights change as well.

We typically use the complexity class TC\(^0\), languages accepted by poly-size, constant-depth with threshold circuits as a rough theoretical model for neural nets. For nonuniform TC\(^0\) we cannot prove any significant upper bounds, we can't even show NEXP, nondeterministic exponential time, cannot be computed by nonuniform TC\(^0\) circuits. And while computing NEXP or even P in TC\(^0\) would be incredibly surprising, these advances in artificial intelligence tell us these nonuniform circuits are far more powerful than we would ever have expected.

19 comments:

Dave Doty3:09 PM, October 29, 2025
Why constant depth? My understanding is that each new, larger frontier model uses more layers than the previous. The big labs stopped publishing their exact layer counts but from publicly available data, it seems layers increased roughly linearly with parameters. Though I'm not sure how parameters have scaled with context-window, which is the "input size n".
ReplyDelete
Replies
dom3:13 PM, October 29, 2025
They are more powerful compared to our earlier, low expectations, but I think that so far they can only solve the same things as humans, so I would rather say that what we experience can be described as that if there is a TC0 circuit for some problem, then there is an algorithm to find one, or a good approximation of it. Is there some theory studying such problems?
ReplyDelete
Replies
Ramon Bejar Torres5:50 PM, October 29, 2025
But, neural networks, as functions, are more like approximation functions, as they not always give the right answer.

Do we have approximation complexity results for non uniform circuits of constant depth?
ReplyDelete
Replies
Anonymous6:57 PM, October 29, 2025
I am not sure why this is the power of non-uniformity. Data structures seem like a better lens for what is happening. I think non-uniformity is interesting only insofar as the circuit behaves drastically differently for different input sizes.
ReplyDelete
Replies
D.S.9:42 PM, October 29, 2025
I've been musing about this for a decade or so, and I must say I am not in complete alignment with Lance's view.

When we think of non-uniform circuits, we're usually talking about a whole bunch of "advice" for each input size. This seems markedly different from all the developments in deep learning or AI; here it seems like a large finite blob of valuable information is learned during training / post-training, and that powers valuable inferences (either for a single problem or a class of problems, all up to some large-ish finite instance size). That "large finite blob" has enough "knowledge" to generate *programs* that have loops in them -- thus it's an odd kind of object from a complexity-theoretic perspective. It seems not to fit into the "non-uniform" computing or even things like "P-uniform" or "logspace-uniform" computing regimes. In particular, it appears that there isn't any type of *problem-specific* and *instance-size-specific* non-uniformity / intelligence learned in these things.
ReplyDelete
Replies
Jakito5:04 AM, October 30, 2025
https://cstheory.stackexchange.com/questions/22354/the-unreasonable-power-of-non-uniformity#22392 :
"I have the impression that the real issue here is the unreasonable heavy burden of proof, not the unreasonable power of non-uniformity. As the answers by chazisop and András Salamon already stress, undecidable languages become computable even in very restricted non-uniform languages, because the burden of proof has been completely waived."

But I also see a new element here, namely that parallelization takes different forms:
0) hardware circuits that implement stuff like floating point and integer arithmetic
1) vectorization via SIMD primitives
2) GPU acceleration via SIMT paradigm, grouping threads into 32-wide bundles called warps.
3) ... other reasons why GPU (or specialized hardware) can accelerate things ...
4) multi-core CPU parallelization

For 0-2 to be fast, the burden to ensure that things work out has to be outsourced to some preprocessing. Each individual processing unit just has to believe that things will work seamlessly together in the great scheme of things, everything would start to break apart if it tried to check (i.e. prove) things on its own.

So it makes "perfect sense" that circuits have to be non-uniform to achieve their full power, or at least that the uniformity of the circuits "should" come from a much more powerful complexity class, like EXPTIME.
ReplyDelete
Replies
space200111:21 AM, October 30, 2025
@Jakito I love this quote "the burden of proof has been completely waived"; the same holds for all these AI hijinks (human in the loop ultimately owns responsibility for establishing correctness of AI generated content).

We are eliding over the fact that the unreplicable power and fault-tolerance of living systems arises from the vast and extensive close-range and long-range communication networks among autonomous microscopic agents.
This communication spans the full gamut of 1-to-1, 1-to-many and many-to-1 connections with connect and disconnects happening constantly and non-determinstically.
While we claim to approximate the molecular configurations as distinct 'states' and model their state-transitions; the communication modes span a continuous space of electro-chemical potential differences.
One example in this domain is the snow-flake crystal where the hexagonal symmetry is established by very long-range (relative to the size of water molecules) electrostatic forces across all molecules within the crystal; of course I am not saying there is anything autonomous inside a snow-flake but the communication is evident in the end macroscopic result (just as in a bilaterally symmetric leaf or animal).
ReplyDelete
Replies
Josh Burdick1:39 PM, October 30, 2025
The explanation I've heard for "why snowflakes are symmetric" is that it's because as a snowflake falls, it goes through different atmospheric conditions (humidity, temperature, wind, etc.), which are basically the same, for all of the snowflake. The crystals which form (analogously to Conway's "Game of Life", except with continuously-changing "rules") play out in the same way, in six- or twelve-fold symmetry.

That's my understanding of the conventional wisdom. Disclaimer: I haven't looked into this much.

snowcrystals.com has some relevant description, and lots of pictures of snowflakes (some with surprising shapes).
ReplyDelete
Replies
darko6:22 PM, November 01, 2025
This is a fantastic post. The `TC^0/poly` model isn't just a clever analogy; it's a literal, structural description of modern AI architectures. It also perfectly frames the "Great Inversion" of our new paradigm.

For 50 years, we’ve assumed "intelligence" lived in the *uniform algorithm*—the class of problems solvable in polynomial time, or **`P`**, as your post defines it. Now, we're in a "Software 2.0" world where the `P` part is a "dumb" commodity (a `MatMul + ReLU` inference engine), and all the "magic" has been pushed into the non-uniform advice string: the `/poly` part, which is the massive file of trained weights.

This brings up the second paradox: how is this "magic" advice created? It's not crafted by a genius; it's "cooked" by one of the "dumbest" algorithms we know, gradient descent.[^1] This is a classic story of emergence: a simple, local rule, scaled up by massive repetition, creates a complex, intelligent global structure. It's the same "magic" as a swarm of "dumb" ants following a simple pheromone rule to "discover" the optimal shortest path, or evolution's "dumb" rules of crossover and mutation "discovering" a solution as complex as the human eye. In the LLM, a trillion "dumb" parameters, each following the local rule "nudge your value 0.001% to be less wrong," self-organize into what we perceive as intelligence.

This leads to the third and final paradox: *why* does this "dumb" process create "magic"? The answer is that this process is, in disguise, the most powerful *compression algorithm* ever built. The model is given a "pedestrian" goal: "You must store the entire internet, but your hard drive is *tiny*." It is *forced* to find shortcuts. The "stupid" way is to memorize. The "genius" way is to discover the underlying rules. It's "cheaper" (a better compression) to invent a statistical concept of gravity than to memorize every instance of a falling apple. "Intelligence" is simply the name we give to a really, really good lossy compression algorithm.

This also highlights the most disruptive "contribution" AI is forcing on our field: a massive paradigm shift from **worst-case** to **average-case** analysis. Our entire theoretical framework, especially the `P` vs. `NP` question, is a fortress built on the bedrock of *worst-case guarantees*. But the LLM, our `TC^0/poly` circuit, is not a creature of logic; it's a creature of *statistical optimization*. It is provably brittle and will fail on a carefully constructed "worst-case" adversarial input. And yet, it is devastatingly effective on the *average-case* distribution of real-world data it was trained on. This proves that our traditional worst-case models are the wrong map for measuring the practical power of these new, non-uniform, statistically-discovered "advice" strings.

This confirms the post's thesis entirely: the power of non-uniformity, when unlocked by massive search, is the biggest (and most surprising) story in computing.

---
*[^1] This is, of course, a simplification. Calling "training" just "gradient descent" ignores the vast engineering of optimizers, RLHF, and data curation. Just as boiling "evolution" down to "mutate" ignores the messy, complex reality of epigenetics and viruses. But the core principle holds: the "magic" emerges from the simple, local rule at the heart of the process, not from a top-down, intelligent design.*
ReplyDelete
Replies
darko6:23 PM, November 01, 2025
comment was written with the help of Gemini Pro 2.5
ReplyDelete
Replies
Anonymous8:38 PM, November 03, 2025
Let me challenge the claims a bit.

First, we have to be more clear what what model architecture and system architecture we are taking about.

The best LLMs by themselves still multiply add two numbers of 20 digits, even the best ones.

So I think that claim is pretty misleading.

RNNs are more like for loops, they are not fixed depth circuits. LLMs are RNNs, so modeling them as TC0 is misleading as well.

A lot of things the models do, in addition to the model, use a lot of guided brute-force search heuristic plus deterministic classical verifiers.

LLMs are also mainly doing interpolation of training datasets. They cannot do out-of-distribution productions. They fail at the tails of training data.

Language is a very low entropy distribution. It is very different from numerical or graph problems which have very high entropy.

From a theory perspective, am I surprised that guided brute force heuristic with a lot of compute power can solve some tasks from a distribution that is likely the way cases of a low entropy problem? I am not.

I think lots of folks miss that the neutral net is just a part of current systems and by itself doesn't do much more that generating within distribution outputs that that looks like human generation.

So I am yet to see evidence that these LLMs networks actually solve even AC0 problems without using tools like calculators or Python Interpretors.

DeepSeek models are available as OSS, you can download V3 and R1 and play with them. On a desktop with a good NVidia GPU you might get 1 token/second (or you can rent from cloud providers if, with H100 you can get maybe 100 tokens/second).

The real power of these models is their ability to be used for semantic search in data plus cross-language translation synthesizing the results given a context. The rest is search database (knowledge graphs), tools (Python interpretor) plus loops plus brute-force with classic verifiers. These combined together is what is we are seeing, not the LLMs themselves.
ReplyDelete
Replies
Anonymous8:41 PM, November 03, 2025
Just having access to Python Code Interpretor as a tool makes these systems Turing-complete.
ReplyDelete
Replies

Add comment