Wednesday, October 29, 2025

AI and the Power of Nonuniform Circuits

The advances in artificial intelligence have changed the way we think about computing. For today's post, how nonuniformity plays a much larger role than I previously believed.

First, some background. Circuit complexity gives a different, more combinatorial, way to look at computation. Instead of viewing algorithms as step-by-step programs, circuit complexity captures computation as fixed hardware, gates wired to compute a function of their inputs.

Every polynomial-time computable language L can be computed by a family \(C_0,C_1,\ldots\) of polynomial-size circuits, where \(C_n\) has n inputs, the number of gates of \(C_n\) is bounded by \(n^c\) for some \(c\) and \(x=x_1\ldots x_n\) is in L iff \(C_n(x_1,\ldots,x_n)\) outputs 1.

The converse isn't true, there are languages L that have polynomial-size circuits that you can't compute in polynomial time, because each \(C_n\) might not have any relation to \(C_m\) for \(n\neq m\). We can get the equivalence by adding a uniformity condition, that there is some easily computable function \(f\) with \(f(1^n)=C_n\). Or we can keep the nonuniformity to define a larger class called P/poly. 

In practice, nonuniformity rarely helps. We can hide randomness in the nonuniformity but we could likely also use pseudorandom generators to the same effect. Nonuniformity was more a technicality that we had to deal with instead of something that seemed to give circuits significantly more power.

Until we reached this new age of AI. A machine learning model like a neural net goes through a lengthy training process to learn its weights, the training optimized offline. The learned weights become the nonuniform circuits, as technology improves and we create larger retrained circuits, the weights change as well.

We typically use the complexity class TC\(^0\), languages accepted by poly-size, constant-depth with threshold circuits as a rough theoretical model for neural nets. For nonuniform TC\(^0\) we cannot prove any significant upper bounds, we can't even show NEXP, nondeterministic exponential time, cannot be computed by nonuniform TC\(^0\) circuits. And while computing NEXP or even P in TC\(^0\) would be incredibly surprising, these advances in artificial intelligence tell us these nonuniform circuits are far more powerful than we would ever have expected. 

19 comments:

  1. Why constant depth? My understanding is that each new, larger frontier model uses more layers than the previous. The big labs stopped publishing their exact layer counts but from publicly available data, it seems layers increased roughly linearly with parameters. Though I'm not sure how parameters have scaled with context-window, which is the "input size n".

    ReplyDelete
    Replies
    1. It's admittedly hard to match asymptotics with real-world implementations. I get the impression that we're seeing a recent trend towards more shallow, wider neural nets, as they can be trained more efficiently in parallel. Shallow is still hundreds of layers.

      Delete
    2. "Until this new age of AI. A machine learning model like a neural net goes through a lengthy training process to learn its weights, the training optimized offline. The learned weights become the nonuniform circuits, as technology improves and we create larger retrained circuits, the weights change as well."

      "Until this new age of AI."??.. Aren't we doing this type of training in the age of AI too?

      Delete
    3. The wording was confusing. I changed it to "Until we reached this new age of AI".

      Delete
    4. Deep neural networks might also be a way to empirically compare how difficult problems are. For instance, detecting cliques, and counting them mod 2 are both functions from undirected graphs to one bit of output. My vague understanding from Toda's theorem is that the counting-type problems are expected to be harder.

      One could code up both of these functions, and compare how well deep learning did on each (for a given network size, activation function, optimizer, etc... maybe across different input graph densities as well.) Then see how well the same NN architecture did on the two problems.

      (Searching a tiny bit, I'm finding some tests of using NNs to detect/find cliques, but not as many things about counting cliques mod 2; but I didn't search much.)

      Delete
  2. They are more powerful compared to our earlier, low expectations, but I think that so far they can only solve the same things as humans, so I would rather say that what we experience can be described as that if there is a TC0 circuit for some problem, then there is an algorithm to find one, or a good approximation of it. Is there some theory studying such problems?

    ReplyDelete
  3. Ramon Bejar Torres5:50 PM, October 29, 2025

    But, neural networks, as functions, are more like approximation functions, as they not always give the right answer.

    Do we have approximation complexity results for non uniform circuits of constant depth?

    ReplyDelete
  4. I am not sure why this is the power of non-uniformity. Data structures seem like a better lens for what is happening. I think non-uniformity is interesting only insofar as the circuit behaves drastically differently for different input sizes.

    ReplyDelete
    Replies
    1. I would argue that GPT 4 behaves "drastically differently" than GPT 3, in some way I probably could not formalize. More drastic I think than the difference between "These two TC0 networks both just multiply integers, but the bigger one can multiply bigger integers." And the jury is still out on whether merely scaling to larger networks is enough to get to (or beyond) AGI, which would certainly count as "drastically different" from GPT 3.

      Of course, one could be pedantic and claim that a uniform Turing machine produced all these networks: on input 1^n, run gradient descent on a transformer network of size poly(n) using poly(n) FLOPS of compute. (Is it a polynomial dependence? I don't know if there's enough publicly available data since the labs stopped publishing their numbers.) But the extra work in going from 10^19 to 10^21 to ... to 10^26 FLOPS of training compute (https://epoch.ai/gradient-updates/why-gpt5-used-less-training-compute-than-gpt45-but-gpt6-probably-wont) sure feels like significantly more work than "extend this integer-multiplying circuit to handle integers with more bits", which is the "typical" sort of uniform family of circuits one wants to model with classes like TC.

      Delete
    2. I think the analogy with non uniformity is a fair one for now. My toy model for non uniformity is that there is an all powerful supervisor that, however, does not offer me all his attention (and answer all my requests) but only gives me sparse advice in the form of some generic advice string — one for each shell of a problem (each size n being one shell). This is what the training is. We could ask the LLM provider directly our question and they could directly spent trillions of dollars to compute the best answer. Or they once and for all spent those trillions to generate a general advice. Underlying here is the assumption (or personal opinion) that the training is either not efficient, or at least not practically efficient.

      Delete
  5. I've been musing about this for a decade or so, and I must say I am not in complete alignment with Lance's view.

    When we think of non-uniform circuits, we're usually talking about a whole bunch of "advice" for each input size. This seems markedly different from all the developments in deep learning or AI; here it seems like a large finite blob of valuable information is learned during training / post-training, and that powers valuable inferences (either for a single problem or a class of problems, all up to some large-ish finite instance size). That "large finite blob" has enough "knowledge" to generate *programs* that have loops in them -- thus it's an odd kind of object from a complexity-theoretic perspective. It seems not to fit into the "non-uniform" computing or even things like "P-uniform" or "logspace-uniform" computing regimes. In particular, it appears that there isn't any type of *problem-specific* and *instance-size-specific* non-uniformity / intelligence learned in these things.

    ReplyDelete
  6. https://cstheory.stackexchange.com/questions/22354/the-unreasonable-power-of-non-uniformity#22392 :
    "I have the impression that the real issue here is the unreasonable heavy burden of proof, not the unreasonable power of non-uniformity. As the answers by chazisop and András Salamon already stress, undecidable languages become computable even in very restricted non-uniform languages, because the burden of proof has been completely waived."

    But I also see a new element here, namely that parallelization takes different forms:
    0) hardware circuits that implement stuff like floating point and integer arithmetic
    1) vectorization via SIMD primitives
    2) GPU acceleration via SIMT paradigm, grouping threads into 32-wide bundles called warps.
    3) ... other reasons why GPU (or specialized hardware) can accelerate things ...
    4) multi-core CPU parallelization

    For 0-2 to be fast, the burden to ensure that things work out has to be outsourced to some preprocessing. Each individual processing unit just has to believe that things will work seamlessly together in the great scheme of things, everything would start to break apart if it tried to check (i.e. prove) things on its own.

    So it makes "perfect sense" that circuits have to be non-uniform to achieve their full power, or at least that the uniformity of the circuits "should" come from a much more powerful complexity class, like EXPTIME.

    ReplyDelete
  7. @Jakito I love this quote "the burden of proof has been completely waived"; the same holds for all these AI hijinks (human in the loop ultimately owns responsibility for establishing correctness of AI generated content).

    We are eliding over the fact that the unreplicable power and fault-tolerance of living systems arises from the vast and extensive close-range and long-range communication networks among autonomous microscopic agents.
    This communication spans the full gamut of 1-to-1, 1-to-many and many-to-1 connections with connect and disconnects happening constantly and non-determinstically.
    While we claim to approximate the molecular configurations as distinct 'states' and model their state-transitions; the communication modes span a continuous space of electro-chemical potential differences.
    One example in this domain is the snow-flake crystal where the hexagonal symmetry is established by very long-range (relative to the size of water molecules) electrostatic forces across all molecules within the crystal; of course I am not saying there is anything autonomous inside a snow-flake but the communication is evident in the end macroscopic result (just as in a bilaterally symmetric leaf or animal).

    ReplyDelete
  8. The explanation I've heard for "why snowflakes are symmetric" is that it's because as a snowflake falls, it goes through different atmospheric conditions (humidity, temperature, wind, etc.), which are basically the same, for all of the snowflake. The crystals which form (analogously to Conway's "Game of Life", except with continuously-changing "rules") play out in the same way, in six- or twelve-fold symmetry.

    That's my understanding of the conventional wisdom. Disclaimer: I haven't looked into this much.

    snowcrystals.com has some relevant description, and lots of pictures of snowflakes (some with surprising shapes).

    ReplyDelete
    Replies
    1. If it's only due to uniform atmospheric conditions, I'd expect a perfect sphere as the crystal falls aggregating water molecules (just like lead shot or steel ball bearings obtained by pouring molten metal into water) but here we always get flattened six-fold symmetry; the explanation I believe is that the water molecules already in the crystal force the location where new molecules are aggregated; and these forces are truly long-range spanning across the entire crystal (as Feynman aptly illustrated electrostatic forces are tremendously powerful despite the inverse square law dropoff).

      Delete
  9. This is a fantastic post. The `TC^0/poly` model isn't just a clever analogy; it's a literal, structural description of modern AI architectures. It also perfectly frames the "Great Inversion" of our new paradigm.

    For 50 years, we’ve assumed "intelligence" lived in the *uniform algorithm*—the class of problems solvable in polynomial time, or **`P`**, as your post defines it. Now, we're in a "Software 2.0" world where the `P` part is a "dumb" commodity (a `MatMul + ReLU` inference engine), and all the "magic" has been pushed into the non-uniform advice string: the `/poly` part, which is the massive file of trained weights.

    This brings up the second paradox: how is this "magic" advice created? It's not crafted by a genius; it's "cooked" by one of the "dumbest" algorithms we know, gradient descent.[^1] This is a classic story of emergence: a simple, local rule, scaled up by massive repetition, creates a complex, intelligent global structure. It's the same "magic" as a swarm of "dumb" ants following a simple pheromone rule to "discover" the optimal shortest path, or evolution's "dumb" rules of crossover and mutation "discovering" a solution as complex as the human eye. In the LLM, a trillion "dumb" parameters, each following the local rule "nudge your value 0.001% to be less wrong," self-organize into what we perceive as intelligence.

    This leads to the third and final paradox: *why* does this "dumb" process create "magic"? The answer is that this process is, in disguise, the most powerful *compression algorithm* ever built. The model is given a "pedestrian" goal: "You must store the entire internet, but your hard drive is *tiny*." It is *forced* to find shortcuts. The "stupid" way is to memorize. The "genius" way is to discover the underlying rules. It's "cheaper" (a better compression) to invent a statistical concept of gravity than to memorize every instance of a falling apple. "Intelligence" is simply the name we give to a really, really good lossy compression algorithm.

    This also highlights the most disruptive "contribution" AI is forcing on our field: a massive paradigm shift from **worst-case** to **average-case** analysis. Our entire theoretical framework, especially the `P` vs. `NP` question, is a fortress built on the bedrock of *worst-case guarantees*. But the LLM, our `TC^0/poly` circuit, is not a creature of logic; it's a creature of *statistical optimization*. It is provably brittle and will fail on a carefully constructed "worst-case" adversarial input. And yet, it is devastatingly effective on the *average-case* distribution of real-world data it was trained on. This proves that our traditional worst-case models are the wrong map for measuring the practical power of these new, non-uniform, statistically-discovered "advice" strings.

    This confirms the post's thesis entirely: the power of non-uniformity, when unlocked by massive search, is the biggest (and most surprising) story in computing.

    ---
    *[^1] This is, of course, a simplification. Calling "training" just "gradient descent" ignores the vast engineering of optimizers, RLHF, and data curation. Just as boiling "evolution" down to "mutate" ignores the messy, complex reality of epigenetics and viruses. But the core principle holds: the "magic" emerges from the simple, local rule at the heart of the process, not from a top-down, intelligent design.*

    ReplyDelete
  10. comment was written with the help of Gemini Pro 2.5

    ReplyDelete
  11. Let me challenge the claims a bit.

    First, we have to be more clear what what model architecture and system architecture we are taking about.

    The best LLMs by themselves still multiply add two numbers of 20 digits, even the best ones.

    So I think that claim is pretty misleading.

    RNNs are more like for loops, they are not fixed depth circuits. LLMs are RNNs, so modeling them as TC0 is misleading as well.

    A lot of things the models do, in addition to the model, use a lot of guided brute-force search heuristic plus deterministic classical verifiers.

    LLMs are also mainly doing interpolation of training datasets. They cannot do out-of-distribution productions. They fail at the tails of training data.

    Language is a very low entropy distribution. It is very different from numerical or graph problems which have very high entropy.

    From a theory perspective, am I surprised that guided brute force heuristic with a lot of compute power can solve some tasks from a distribution that is likely the way cases of a low entropy problem? I am not.

    I think lots of folks miss that the neutral net is just a part of current systems and by itself doesn't do much more that generating within distribution outputs that that looks like human generation.

    So I am yet to see evidence that these LLMs networks actually solve even AC0 problems without using tools like calculators or Python Interpretors.

    DeepSeek models are available as OSS, you can download V3 and R1 and play with them. On a desktop with a good NVidia GPU you might get 1 token/second (or you can rent from cloud providers if, with H100 you can get maybe 100 tokens/second).

    The real power of these models is their ability to be used for semantic search in data plus cross-language translation synthesizing the results given a context. The rest is search database (knowledge graphs), tools (Python interpretor) plus loops plus brute-force with classic verifiers. These combined together is what is we are seeing, not the LLMs themselves.

    ReplyDelete
  12. Just having access to Python Code Interpretor as a tool makes these systems Turing-complete.

    ReplyDelete