Computational Complexity: LEAPing into the Future of Coding

Sunday, April 26, 2026

LEAPing into the Future of Coding

A few months ago in Oxford, Bernard Sufrin, an emeritus fellow, said he's looking to hire a student to implement LEAP (Logic Engine for Argument by Pointing), a way to teach logic by proving basic logic theorems via pointing and clicking. Rahul Santhanam said why not give it to AI. Bernard said AI can't handle this task. I thought why not give it a try.

I gave Claude a single prompt that Bernard helped me formulate:

Architecture of a system to support proof by pointing in first order logic using LEAP

About an hour later, without giving additional details, we had a working prototype. Give it a try. I put the code and some more details on GitHub.

Watching Claude work was amazing. Claude created an architecture based on the paper Proof by Pointing by Yves Bertot, Gilles Kahn and Laurent Théry.

Claude then asked me:

Would you like me to proceed with implementing any of these layers?

Why not? So I told Claude to go ahead.

It started coding and it created 78 test cases and kept debugging itself until all those test cases worked out. It took me longer to get the program working on the web than Claude took to program it. I sent the link to Bernard who responded "Wow! I take it back if the program was all synthesized from the prompt."

When I asked Bernard a month later if I could post about this program, he agreed, though he had some additional comments.

The consolation for me is that the outcome, though surprising and I suppose delightful (though it didn't manage rules for quantifiers), didn't refute my conjecture that Claude et al cannot do "architecture". The architecture (MVC) is stock; the "proof by pointing" hint led it to a paper of the same name which gave details of how to derive terms/formulae/goals in the (unfinished) proof from screen coordinates. Maybe you could give a substantively different prompt.
It may be of interest to know that the Bornat/Sufrin JAPE program, still on GitHub, was eventually a lot more ambitious when it came to selecting things: terms, subterms, goals. But it took more than 2 hours to build!

Bernard has a point. It wasn't just the fifteen-word prompt alone. Claude leaned on the Bertot et al. paper to guide its design and implementation. Still, that Claude did architect, build and debug the system from the prompt and the paper is truly impressive. We've truly crossed a threshold for coding, far beyond what would have been possible just a few months earlier.

20 comments:

Anonymous5:40 PM, April 26, 2026
I typed into the system p | ~p which is p OR not(p) and i said

Proof Tree o |- p vee ~p

I typed into the system p & ~p which is p AND not(p) and it said

Proof Tree o |- p wedge ~p

So it can proof true AND false statements! That can't be good.
ReplyDelete
Replies
Anonymous8:44 PM, April 26, 2026
so i don't quite understand how this works. previous commentator said u can say p & ~p which should evaluate to false ... but i guess that's not the point of this gadget. i see.

What version/product of claude was used for this, i suspect this must be paid version but which model?
ReplyDelete
Replies
Anonymous10:36 PM, April 26, 2026
I tried it out as follows: I typed in
(( p->q) & (q->r)) -> (p->r)
and it just put that statement in the proof box and then wrote it again as the top of the proof tree but thats it. Not in the slightest enlightening or interesting. what is it supposed to output?
ReplyDelete
Replies
DMX Dragon6:38 AM, April 27, 2026
I have an interesting idea: what if we model P problems as neurons and NP as their connections in a network? When the weight of P and NP are balanced, the answer is correct. When slightly unbalanced, the answer is approximate. When very unbalanced, it's wrong. This creates a spectrum instead of a binary yes/no. The brain and AI both work this way — approximating answers through balance. Has this approach been explored in complexity theory?
ReplyDelete
Replies
Charles10:00 AM, April 27, 2026
Claude / ChatGPT / Gemini and others are all very good advancements in the NLP realm. However, you (not necessarily you, but the generic you) just have to be careful because they are also programmed to be very rewarding. As an example, a user could come to the LLM with a "New Math" thinking they solved the Collatz conjecture and the model is just as likely to call it groundbreaking as to reject it.

As far as the coding part, there is still the risk of hallucinations. What I mean by that is the LLMs linking you to packages that do not exist. (say a spaCy version or pandas or numpy, etc). Naively, we simply run it.

What happens is that these LLMs recommend the same non-existent packages over and over again. So hackers see this as an opportunity to insert malware into these packages. So they will create a package that "looks" like said package but with added vulnerabilities.

This is a growing problem because more and more people are vibe coding and not checking

This is a paper that talks about it and analyzes the code generation process by different LLMs.
https://arxiv.org/html/2406.10279v3
ReplyDelete
Replies
Anonymous8:09 AM, April 28, 2026
here is a thought on the current state of AI coding:

it is a very useful tool, but there is also a lot of hype.

take e g. Anthropic's own software: I keep finding trivial bugs in it that I would not expect from any serious software.

or take their Mythos announcement. Isn't it amazing that they got hacked for access to that model? weren't they using it to catch their own security bugs?

on short, these AI coding tools have still a long way to go.

the real problem the industry is dealing with is that it is now much easier to create low quality choice, so there is much more work for humans to review code and catch mistakes. but people are lazy, so they do not actually check the huge amount of coffee that is generated, same way people often do not verify incorrect statements chat bots return before posting them.

this is a security nightmare, we are going back to early Internet days before we got serious with cyber security

and I am not even talking about the risks from adversarial inputs through software supply chain.

I have yet to see any serious non-MVP software that is created without a lot of human involvement.

according to Google's own internal numbers, their engineers are not even 2x productive. the productivity improvement overall in software development is mostly marginal at this point, maybe 5 to 20 percent.

it makes me really sad that reputable scientists repeat AI hype simply based on toy mvp coding projects, and without actually looking at hard software engineering productivity numbers.

it is the job of academia to scrutinize the AI hype of AI labs and many are failing to do so.
ReplyDelete
Replies
Anonymous3:36 AM, April 29, 2026
the last anon poster. there is some sentiment that I subscribe to. but truth be told. a lot of code these days is generated by these llms. and it's substantially improved since 3 years go to the point of being able to actually right correct code that runs and provides output that seems at least reasonably correct.
is it truly correct? come on, i don't know, and most of us probably won't look that deeply into it after seeing that it works for the test banks ... the same applies to ordinary code you probably come up with ... proving code to run provably correct is that not resolving the Halting Problem? and what about other undecidability issues ?
ReplyDelete
Replies
David Marcus9:19 AM, April 29, 2026
I took a look at the code and the architecture document. Both look poorly written to me. I didn't try the code. But, even assuming it works, that's a low bar.
ReplyDelete
Replies
space200110:49 AM, April 29, 2026
As most everyone is aware, the situation for the human project is far more stark than just this AI fad.
The only aspect in which we excel (relative to other life) is the prodigious rate at which we've been going through whatever non-renewable resource we can get our hands on AND we're the only ones with the ability to read the past and foretell the future quite accurately.
A 42 Gal barrel of crude oil represents ~8.5 years of an adult human's calorific input (I don't know the vast numbers of, and the duration over which, marine phytoplanktons sequestered this quantity of concentrated energy and organic matter); it is no surprise we're addicted to this and other irreplaceable fossilized energy milkshakes.
The several trillion dollar valuations of our commercial enterprises is the thing we're all clamoring to prop-op, or else ... as we've always done, wage wars to prop up the economy.
Ohh there is noting to worry about doomsdayers spouting nonsense; AI is going to unblock nuclear fusion or invent some other magical physics in only a matter of years.

In response to anon above:
To God obedient, O Muse,
Demand no wreath, fear no abuse.
Remain to praise and slander cool,
And do not argue with a fool. -A.Pushkin.
ReplyDelete
Replies
Anonymous5:52 AM, April 30, 2026
I'm seeing more code right now, but I question the quality, or really the quantity of how much a good developer could produce by going to sites like codereview.stackexchange.com and just browsing the wealth of archives. That, along with github and communities on reddit and quora have existed for decades and allowed us to quickly combine functions into working programs.

All these LLMs are doing is speeding up that process, but adding many extra layers. I say extra layers because their design and my design can be totally different and it is very difficult to explain to an llm what I see or how I want it to look.

Then there's the added frustration of me saying "add this button on at the bottom of the page" and now the llm thinks it needs to redesign the entire concept, which we have been arguing over for hours.

I think it's great that people are interested in coding, especially mathematics. However, as mathematicians who are interested in proof and academia more than lay coders doing this for hobby, we cannot simply trust an overconfident LLM.
ReplyDelete
Replies
Anonymous9:46 AM, April 30, 2026
This recent talk by Andrej Karphaty has some interesting points about the state of AI coding:

https://youtu.be/96jN2OCOfLs

ReplyDelete
Replies
Anonymous1:35 PM, April 30, 2026
Consider what LLMs do well - predicting the next word, to an extent. There is an advanced model to this, but this is why coding and conversation works so well for them.

You know what doesn't work well for them? Ciphers and encrption. Because they're bad at math. And with primality based encryption even being off by a little messes your whole system up.

One of the reason that LLMs are so good at coding is that all the code online is parsed. Dating back to the ancient days of BASIC and COBAL and FORTRAN we always had a start and stop for segments of code. So that made it easily computer / human readable. Same goes for C/C++/Java/Python (a bit different with tabs) / Javascript, etc. Its all online and easily parsable into segments.

Just think, if that same code that is open to GitHib had been run through a scrambler first, we would have never seen this age of AI Agents.

There is definitely a "sweet spot" of AI coding, but there is a point that these agents cannot reach (Godel, Cook, Turing).

It feels like magic right now because much of what we do right now is repetition. We have to copy and paste a lot of old code and these models help with that. But the moment it becomes time (for the model only) to invent a new algorithm or formula or procedure, that is where we see the hallucinations.
ReplyDelete
Replies
Anonymous12:32 PM, May 01, 2026
Demis Hassabis recently said that a good test would be to show train a model only on data predating 1900 and see if the model can come up with Einstein's theory of Special Relativity.

A more challenging task would be to come up with his theory of General Relativity. The Special Relativity is often said to be discoverable by others, the General Relativity truly requires the mind of an original exceptional genius way ahead of his time.
ReplyDelete
Replies

Add comment