Sunday, April 26, 2026

LEAPing into the Future of Coding

A few months ago in Oxford, Bernard Sufrin, an emeritus fellow, said he's looking to hire a student to implement LEAP (Logic Engine for Argument by Pointing), a way to teach logic by proving basic logic theorems via pointing and clicking. Rahul Santhanam said why not give it to AI. Bernard said AI can't handle this task. I thought why not give it a try.

I gave Claude a single prompt that Bernard helped me formulate:

Architecture of a system to support proof by pointing in first order logic using LEAP

About an hour later, without giving additional details, we had a working prototype. Give it a try. I put the code and some more details on GitHub.

Watching Claude work was amazing. Claude created an architecture based on the paper Proof by Pointing by Yves Bertot, Gilles Kahn and Laurent Théry.

Claude then asked me:

Would you like me to proceed with implementing any of these layers?

Why not? So I told Claude to go ahead.

It started coding and it created 78 test cases and kept debugging itself until all those test cases worked out. It took me longer to get the program working on the web than Claude took to program it. I sent the link to Bernard who responded "Wow! I take it back if the program was all synthesized from the prompt." 

When I asked Bernard a month later if I could post about this program, he agreed, though he had some additional comments.

The consolation for me is that the outcome, though surprising and I suppose delightful (though it didn't manage rules for quantifiers), didn't refute my conjecture that Claude et al cannot do "architecture". The architecture (MVC) is stock; the "proof by pointing" hint led it to a paper of the same name which gave details of how to derive terms/formulae/goals in the (unfinished) proof from screen coordinates. Maybe you could give a substantively different prompt.

It may be of interest to know that the Bornat/Sufrin JAPE program, still on GitHub, was eventually a lot more ambitious when it came to selecting things: terms, subterms, goals. But it took more than 2 hours to build!

Bernard has a point. It wasn't just the fifteen-word prompt alone. Claude leaned on the Bertot et al. paper to guide its design and implementation. Still, that Claude did architect, build and debug the system from the prompt and the paper is truly impressive. We've truly crossed a threshold for coding, far beyond what would have been possible just a few months earlier.

13 comments:

  1. I typed into the system p | ~p which is p OR not(p) and i said

    Proof Tree o |- p vee ~p

    I typed into the system p & ~p which is p AND not(p) and it said

    Proof Tree o |- p wedge ~p

    So it can proof true AND false statements! That can't be good.

    ReplyDelete
    Replies
    1. You can't complete the proof of
      p and not p

      Delete
    2. but the output did not tell me that. It should have told me that.

      Delete
    3. Prove both the input and NOT the input formula and whichever succeeds you're done, why the AI needs to be told to do what's obvious to a human.

      Delete
  2. so i don't quite understand how this works. previous commentator said u can say p & ~p which should evaluate to false ... but i guess that's not the point of this gadget. i see.

    What version/product of claude was used for this, i suspect this must be paid version but which model?

    ReplyDelete
    Replies
    1. I did use the paid version, Opus 4.6. Instructions in the Readme

      Delete
  3. I tried it out as follows: I typed in
    (( p->q) & (q->r)) -> (p->r)
    and it just put that statement in the proof box and then wrote it again as the top of the proof tree but thats it. Not in the slightest enlightening or interesting. what is it supposed to output?

    ReplyDelete
  4. I have an interesting idea: what if we model P problems as neurons and NP as their connections in a network? When the weight of P and NP are balanced, the answer is correct. When slightly unbalanced, the answer is approximate. When very unbalanced, it's wrong. This creates a spectrum instead of a binary yes/no. The brain and AI both work this way — approximating answers through balance. Has this approach been explored in complexity theory?

    ReplyDelete
  5. Claude / ChatGPT / Gemini and others are all very good advancements in the NLP realm. However, you (not necessarily you, but the generic you) just have to be careful because they are also programmed to be very rewarding. As an example, a user could come to the LLM with a "New Math" thinking they solved the Collatz conjecture and the model is just as likely to call it groundbreaking as to reject it.

    As far as the coding part, there is still the risk of hallucinations. What I mean by that is the LLMs linking you to packages that do not exist. (say a spaCy version or pandas or numpy, etc). Naively, we simply run it.

    What happens is that these LLMs recommend the same non-existent packages over and over again. So hackers see this as an opportunity to insert malware into these packages. So they will create a package that "looks" like said package but with added vulnerabilities.

    This is a growing problem because more and more people are vibe coding and not checking


    This is a paper that talks about it and analyzes the code generation process by different LLMs.
    https://arxiv.org/html/2406.10279v3

    ReplyDelete
  6. here is a thought on the current state of AI coding:

    it is a very useful tool, but there is also a lot of hype.

    take e g. Anthropic's own software: I keep finding trivial bugs in it that I would not expect from any serious software.

    or take their Mythos announcement. Isn't it amazing that they got hacked for access to that model? weren't they using it to catch their own security bugs?

    on short, these AI coding tools have still a long way to go.

    the real problem the industry is dealing with is that it is now much easier to create low quality choice, so there is much more work for humans to review code and catch mistakes. but people are lazy, so they do not actually check the huge amount of coffee that is generated, same way people often do not verify incorrect statements chat bots return before posting them.

    this is a security nightmare, we are going back to early Internet days before we got serious with cyber security

    and I am not even talking about the risks from adversarial inputs through software supply chain.

    I have yet to see any serious non-MVP software that is created without a lot of human involvement.

    according to Google's own internal numbers, their engineers are not even 2x productive. the productivity improvement overall in software development is mostly marginal at this point, maybe 5 to 20 percent.

    it makes me really sad that reputable scientists repeat AI hype simply based on toy mvp coding projects, and without actually looking at hard software engineering productivity numbers.

    it is the job of academia to scrutinize the AI hype of AI labs and many are failing to do so.

    ReplyDelete
  7. the last anon poster. there is some sentiment that I subscribe to. but truth be told. a lot of code these days is generated by these llms. and it's substantially improved since 3 years go to the point of being able to actually right correct code that runs and provides output that seems at least reasonably correct.
    is it truly correct? come on, i don't know, and most of us probably won't look that deeply into it after seeing that it works for the test banks ... the same applies to ordinary code you probably come up with ... proving code to run provably correct is that not resolving the Halting Problem? and what about other undecidability issues ?

    ReplyDelete
    Replies
    1. This is the risk and the reward. 90% of the code may be better, but how many edge cases does it take to show a proof collapses? Just one.

      What's going on now is that even with this abundance of code, we're seeing more security risks by this code. Not everywhere but enough places that we're all vulnerable.

      There's a saying that you never argue with a fool. Arguing with an LLM is that way because they are confidently wrong. But these models are all over social media confusing words, mis-remembering passwords, deleting the wrong files, etc. Yet this is supposed to be the new age of coding.

      Delete
  8. I took a look at the code and the architecture document. Both look poorly written to me. I didn't try the code. But, even assuming it works, that's a low bar.

    ReplyDelete