Wednesday, October 29, 2025

AI and the Power of Nonuniform Circuits

The advances in artificial intelligence have changed the way we think about computing. For today's post, how nonuniformity plays a much larger role than I previously believed.

First, some background. Circuit complexity gives a different, more combinatorial, way to look at computation. Instead of viewing algorithms as step-by-step programs, circuit complexity captures computation as fixed hardware, gates wired to compute a function of their inputs.

Every polynomial-time computable language L can be computed by a family \(C_0,C_1,\ldots\) of polynomial-size circuits, where \(C_n\) has n inputs, the number of gates of \(C_n\) is bounded by \(n^c\) for some \(c\) and \(x=x_1\ldots x_n\) is in L iff \(C_n(x_1,\ldots,x_n)\) outputs 1.

The converse isn't true, there are languages L that have polynomial-size circuits that you can't compute in polynomial time, because each \(C_n\) might not have any relation to \(C_m\) for \(n\neq m\). We can get the equivalence by adding a uniformity condition, that there is some easily computable function \(f\) with \(f(1^n)=C_n\). Or we can keep the nonuniformity to define a larger class called P/poly. 

In practice, nonuniformity rarely helps. We can hide randomness in the nonuniformity but we could likely also use pseudorandom generators to the same effect. Nonuniformity was more a technicality that we had to deal with instead of something that seemed to give circuits significantly more power.

Until we reached this new age of AI. A machine learning model like a neural net goes through a lengthy training process to learn its weights, the training optimized offline. The learned weights become the nonuniform circuits, as technology improves and we create larger retrained circuits, the weights change as well.

We typically use the complexity class TC\(^0\), languages accepted by poly-size, constant-depth with threshold circuits as a rough theoretical model for neural nets. For nonuniform TC\(^0\) we cannot prove any significant upper bounds, we can't even show NEXP, nondeterministic exponential time, cannot be computed by nonuniform TC\(^0\) circuits. And while computing NEXP or even P in TC\(^0\) would be incredibly surprising, these advances in artificial intelligence tell us these nonuniform circuits are far more powerful than we would ever have expected. 

13 comments:

  1. Why constant depth? My understanding is that each new, larger frontier model uses more layers than the previous. The big labs stopped publishing their exact layer counts but from publicly available data, it seems layers increased roughly linearly with parameters. Though I'm not sure how parameters have scaled with context-window, which is the "input size n".

    ReplyDelete
    Replies
    1. It's admittedly hard to match asymptotics with real-world implementations. I get the impression that we're seeing a recent trend towards more shallow, wider neural nets, as they can be trained more efficiently in parallel. Shallow is still hundreds of layers.

      Delete
    2. "Until this new age of AI. A machine learning model like a neural net goes through a lengthy training process to learn its weights, the training optimized offline. The learned weights become the nonuniform circuits, as technology improves and we create larger retrained circuits, the weights change as well."

      "Until this new age of AI."??.. Aren't we doing this type of training in the age of AI too?

      Delete
    3. The wording was confusing. I changed it to "Until we reached this new age of AI".

      Delete
  2. They are more powerful compared to our earlier, low expectations, but I think that so far they can only solve the same things as humans, so I would rather say that what we experience can be described as that if there is a TC0 circuit for some problem, then there is an algorithm to find one, or a good approximation of it. Is there some theory studying such problems?

    ReplyDelete
  3. Ramon Bejar Torres5:50 PM, October 29, 2025

    But, neural networks, as functions, are more like approximation functions, as they not always give the right answer.

    Do we have approximation complexity results for non uniform circuits of constant depth?

    ReplyDelete
  4. I am not sure why this is the power of non-uniformity. Data structures seem like a better lens for what is happening. I think non-uniformity is interesting only insofar as the circuit behaves drastically differently for different input sizes.

    ReplyDelete
    Replies
    1. I would argue that GPT 4 behaves "drastically differently" than GPT 3, in some way I probably could not formalize. More drastic I think than the difference between "These two TC0 networks both just multiply integers, but the bigger one can multiply bigger integers." And the jury is still out on whether merely scaling to larger networks is enough to get to (or beyond) AGI, which would certainly count as "drastically different" from GPT 3.

      Of course, one could be pedantic and claim that a uniform Turing machine produced all these networks: on input 1^n, run gradient descent on a transformer network of size poly(n) using poly(n) FLOPS of compute. (Is it a polynomial dependence? I don't know if there's enough publicly available data since the labs stopped publishing their numbers.) But the extra work in going from 10^19 to 10^21 to ... to 10^26 FLOPS of training compute (https://epoch.ai/gradient-updates/why-gpt5-used-less-training-compute-than-gpt45-but-gpt6-probably-wont) sure feels like significantly more work than "extend this integer-multiplying circuit to handle integers with more bits", which is the "typical" sort of uniform family of circuits one wants to model with classes like TC.

      Delete
  5. I've been musing about this for a decade or so, and I must say I am not in complete alignment with Lance's view.

    When we think of non-uniform circuits, we're usually talking about a whole bunch of "advice" for each input size. This seems markedly different from all the developments in deep learning or AI; here it seems like a large finite blob of valuable information is learned during training / post-training, and that powers valuable inferences (either for a single problem or a class of problems, all up to some large-ish finite instance size). That "large finite blob" has enough "knowledge" to generate *programs* that have loops in them -- thus it's an odd kind of object from a complexity-theoretic perspective. It seems not to fit into the "non-uniform" computing or even things like "P-uniform" or "logspace-uniform" computing regimes. In particular, it appears that there isn't any type of *problem-specific* and *instance-size-specific* non-uniformity / intelligence learned in these things.

    ReplyDelete
  6. https://cstheory.stackexchange.com/questions/22354/the-unreasonable-power-of-non-uniformity#22392 :
    "I have the impression that the real issue here is the unreasonable heavy burden of proof, not the unreasonable power of non-uniformity. As the answers by chazisop and András Salamon already stress, undecidable languages become computable even in very restricted non-uniform languages, because the burden of proof has been completely waived."

    But I also see a new element here, namely that parallelization takes different forms:
    0) hardware circuits that implement stuff like floating point and integer arithmetic
    1) vectorization via SIMD primitives
    2) GPU acceleration via SIMT paradigm, grouping threads into 32-wide bundles called warps.
    3) ... other reasons why GPU (or specialized hardware) can accelerate things ...
    4) multi-core CPU parallelization

    For 0-2 to be fast, the burden to ensure that things work out has to be outsourced to some preprocessing. Each individual processing unit just has to believe that things will work seamlessly together in the great scheme of things, everything would start to break apart if it tried to check (i.e. prove) things on its own.

    So it makes "perfect sense" that circuits have to be non-uniform to achieve their full power, or at least that the uniformity of the circuits "should" come from a much more powerful complexity class, like EXPTIME.

    ReplyDelete
  7. @Jakito I love this quote "the burden of proof has been completely waived"; the same holds for all these AI hijinks (human in the loop ultimately owns responsibility for establishing correctness of AI generated content).

    We are eliding over the fact that the unreplicable power and fault-tolerance of living systems arises from the vast and extensive close-range and long-range communication networks among autonomous microscopic agents.
    This communication spans the full gamut of 1-to-1, 1-to-many and many-to-1 connections with connect and disconnects happening constantly and non-determinstically.
    While we claim to approximate the molecular configurations as distinct 'states' and model their state-transitions; the communication modes span a continuous space of electro-chemical potential differences.
    One example in this domain is the snow-flake crystal where the hexagonal symmetry is established by very long-range (relative to the size of water molecules) electrostatic forces across all molecules within the crystal; of course I am not saying there is anything autonomous inside a snow-flake but the communication is evident in the end macroscopic result (just as in a bilaterally symmetric leaf or animal).

    ReplyDelete
  8. The explanation I've heard for "why snowflakes are symmetric" is that it's because as a snowflake falls, it goes through different atmospheric conditions (humidity, temperature, wind, etc.), which are basically the same, for all of the snowflake. The crystals which form (analogously to Conway's "Game of Life", except with continuously-changing "rules") play out in the same way, in six- or twelve-fold symmetry.

    That's my understanding of the conventional wisdom. Disclaimer: I haven't looked into this much.

    snowcrystals.com has some relevant description, and lots of pictures of snowflakes (some with surprising shapes).

    ReplyDelete
    Replies
    1. If it's only due to uniform atmospheric conditions, I'd expect a perfect sphere as the crystal falls aggregating water molecules (just like lead shot or steel ball bearings obtained by pouring molten metal into water) but here we always get flattened six-fold symmetry; the explanation I believe is that the water molecules already in the crystal force the location where new molecules are aggregated; and these forces are truly long-range spanning across the entire crystal (as Feynman aptly illustrated electrostatic forces are tremendously powerful despite the inverse square law dropoff).

      Delete