Sunday, April 26, 2026

LEAPing into the Future of Coding

A few months ago in Oxford, Bernard Sufrin, an emeritus fellow, said he's looking to hire a student to implement LEAP (Logic Engine for Argument by Pointing), a way to teach logic by proving basic logic theorems via pointing and clicking. Rahul Santhanam said why not give it to AI. Bernard said AI can't handle this task. I thought why not give it a try.

I gave Claude a single prompt that Bernard helped me formulate:

Architecture of a system to support proof by pointing in first order logic using LEAP

About an hour later, without giving additional details, we had a working prototype. Give it a try. I put the code and some more details on GitHub.

Watching Claude work was amazing. Claude created an architecture based on the paper Proof by Pointing by Yves Bertot, Gilles Kahn and Laurent Théry.

Claude then asked me:

Would you like me to proceed with implementing any of these layers?

Why not? So I told Claude to go ahead.

It started coding and it created 78 test cases and kept debugging itself until all those test cases worked out. It took me longer to get the program working on the web than Claude took to program it. I sent the link to Bernard who responded "Wow! I take it back if the program was all synthesized from the prompt." 

When I asked Bernard a month later if I could post about this program, he agreed, though he had some additional comments.

The consolation for me is that the outcome, though surprising and I suppose delightful (though it didn't manage rules for quantifiers), didn't refute my conjecture that Claude et al cannot do "architecture". The architecture (MVC) is stock; the "proof by pointing" hint led it to a paper of the same name which gave details of how to derive terms/formulae/goals in the (unfinished) proof from screen coordinates. Maybe you could give a substantively different prompt.

It may be of interest to know that the Bornat/Sufrin JAPE program, still on GitHub, was eventually a lot more ambitious when it came to selecting things: terms, subterms, goals. But it took more than 2 hours to build!

Bernard has a point. It wasn't just the fifteen-word prompt alone. Claude leaned on the Bertot et al. paper to guide its design and implementation. Still, that Claude did architect, build and debug the system from the prompt and the paper is truly impressive. We've truly crossed a threshold for coding, far beyond what would have been possible just a few months earlier.

19 comments:

  1. I typed into the system p | ~p which is p OR not(p) and i said

    Proof Tree o |- p vee ~p

    I typed into the system p & ~p which is p AND not(p) and it said

    Proof Tree o |- p wedge ~p

    So it can proof true AND false statements! That can't be good.

    ReplyDelete
    Replies
    1. You can't complete the proof of
      p and not p

      Delete
    2. but the output did not tell me that. It should have told me that.

      Delete
    3. Prove both the input and NOT the input formula and whichever succeeds you're done, why the AI needs to be told to do what's obvious to a human.

      Delete
  2. so i don't quite understand how this works. previous commentator said u can say p & ~p which should evaluate to false ... but i guess that's not the point of this gadget. i see.

    What version/product of claude was used for this, i suspect this must be paid version but which model?

    ReplyDelete
    Replies
    1. I did use the paid version, Opus 4.6. Instructions in the Readme

      Delete
  3. I tried it out as follows: I typed in
    (( p->q) & (q->r)) -> (p->r)
    and it just put that statement in the proof box and then wrote it again as the top of the proof tree but thats it. Not in the slightest enlightening or interesting. what is it supposed to output?

    ReplyDelete
  4. I have an interesting idea: what if we model P problems as neurons and NP as their connections in a network? When the weight of P and NP are balanced, the answer is correct. When slightly unbalanced, the answer is approximate. When very unbalanced, it's wrong. This creates a spectrum instead of a binary yes/no. The brain and AI both work this way — approximating answers through balance. Has this approach been explored in complexity theory?

    ReplyDelete
  5. Claude / ChatGPT / Gemini and others are all very good advancements in the NLP realm. However, you (not necessarily you, but the generic you) just have to be careful because they are also programmed to be very rewarding. As an example, a user could come to the LLM with a "New Math" thinking they solved the Collatz conjecture and the model is just as likely to call it groundbreaking as to reject it.

    As far as the coding part, there is still the risk of hallucinations. What I mean by that is the LLMs linking you to packages that do not exist. (say a spaCy version or pandas or numpy, etc). Naively, we simply run it.

    What happens is that these LLMs recommend the same non-existent packages over and over again. So hackers see this as an opportunity to insert malware into these packages. So they will create a package that "looks" like said package but with added vulnerabilities.

    This is a growing problem because more and more people are vibe coding and not checking


    This is a paper that talks about it and analyzes the code generation process by different LLMs.
    https://arxiv.org/html/2406.10279v3

    ReplyDelete
  6. here is a thought on the current state of AI coding:

    it is a very useful tool, but there is also a lot of hype.

    take e g. Anthropic's own software: I keep finding trivial bugs in it that I would not expect from any serious software.

    or take their Mythos announcement. Isn't it amazing that they got hacked for access to that model? weren't they using it to catch their own security bugs?

    on short, these AI coding tools have still a long way to go.

    the real problem the industry is dealing with is that it is now much easier to create low quality choice, so there is much more work for humans to review code and catch mistakes. but people are lazy, so they do not actually check the huge amount of coffee that is generated, same way people often do not verify incorrect statements chat bots return before posting them.

    this is a security nightmare, we are going back to early Internet days before we got serious with cyber security

    and I am not even talking about the risks from adversarial inputs through software supply chain.

    I have yet to see any serious non-MVP software that is created without a lot of human involvement.

    according to Google's own internal numbers, their engineers are not even 2x productive. the productivity improvement overall in software development is mostly marginal at this point, maybe 5 to 20 percent.

    it makes me really sad that reputable scientists repeat AI hype simply based on toy mvp coding projects, and without actually looking at hard software engineering productivity numbers.

    it is the job of academia to scrutinize the AI hype of AI labs and many are failing to do so.

    ReplyDelete
  7. the last anon poster. there is some sentiment that I subscribe to. but truth be told. a lot of code these days is generated by these llms. and it's substantially improved since 3 years go to the point of being able to actually right correct code that runs and provides output that seems at least reasonably correct.
    is it truly correct? come on, i don't know, and most of us probably won't look that deeply into it after seeing that it works for the test banks ... the same applies to ordinary code you probably come up with ... proving code to run provably correct is that not resolving the Halting Problem? and what about other undecidability issues ?

    ReplyDelete
    Replies
    1. This is the risk and the reward. 90% of the code may be better, but how many edge cases does it take to show a proof collapses? Just one.

      What's going on now is that even with this abundance of code, we're seeing more security risks by this code. Not everywhere but enough places that we're all vulnerable.

      There's a saying that you never argue with a fool. Arguing with an LLM is that way because they are confidently wrong. But these models are all over social media confusing words, mis-remembering passwords, deleting the wrong files, etc. Yet this is supposed to be the new age of coding.

      Delete
    2. there is a lot of code that is generated using these llms. but software that ships, you have to measure the actual productivity. in that you have to take into account the number of times llm generated code and the programmer needs to reject it, white or, tell the llm to change it, ...

      there are a lot of software developer productivity metrics that would be interesting to know regarding these AI coding systems, but the big companies do not publish them. they punish useless metrics like some percentage of code being generated by AI. it is a bit like big tobacco companies not publishing their own health risk research publicly.

      they are also laying off people. if you are not into these topics you might think it is because of productivity gains. in reality, they are internally and extremely cancelling projects to move people around because they are actually short on the number of engineers they need. the lay offs are mainly to create financial space for massive capital investment they are doing into AI data centers. if these systems were really that good, you would be expecting these companies to get rid of massive amount of open small bugs they internally track but never find time and resources to fix, not cancel projects because of storage of engineers they have.

      from what I hear, a large number of projects are silently being cancelled at these big tech companies to reduce operational cost by laying off people, and to move engineers to work on more critical projects.

      have these coding AI agents improved? yes. are they good enough to be able to do the job of a junior software engineer? not yet. have they led to massive improvement in engineering protectively? not yet, the internal estimates I hear are in the range of 5 to 20 precent.

      now 5 to 20 percent improvement is not bad at all, but the hype would make you believe that the software engineering productivity improvement because of these AI coding tools is 2x or 10x or 100x, which are false.

      can they get to 2x productivity? probably, in a few years. but these major AI labs keep over promising and under delivering and it is good to keep that in mind.

      Delete
    3. it is really ironic that Anthropic's announcement about their super duper security model that would find and fix security issues was succeeded very soon by an announcement by Anthropic that someone hacked Anthropic and got unauthorized access to that model.

      if it is really so good why their own software gets hacked so easily?

      if they cannot protect their most sensitive model from unauthorized access, would you think that they can protect your sensitive data or the software generated by their models will not lead to massive security incidents?

      Delete
  8. I took a look at the code and the architecture document. Both look poorly written to me. I didn't try the code. But, even assuming it works, that's a low bar.

    ReplyDelete
  9. As most everyone is aware, the situation for the human project is far more stark than just this AI fad.
    The only aspect in which we excel (relative to other life) is the prodigious rate at which we've been going through whatever non-renewable resource we can get our hands on AND we're the only ones with the ability to read the past and foretell the future quite accurately.
    A 42 Gal barrel of crude oil represents ~8.5 years of an adult human's calorific input (I don't know the vast numbers of, and the duration over which, marine phytoplanktons sequestered this quantity of concentrated energy and organic matter); it is no surprise we're addicted to this and other irreplaceable fossilized energy milkshakes.
    The several trillion dollar valuations of our commercial enterprises is the thing we're all clamoring to prop-op, or else ... as we've always done, wage wars to prop up the economy.
    Ohh there is noting to worry about doomsdayers spouting nonsense; AI is going to unblock nuclear fusion or invent some other magical physics in only a matter of years.

    In response to anon above:
    To God obedient, O Muse,
    Demand no wreath, fear no abuse.
    Remain to praise and slander cool,
    And do not argue with a fool. -A.Pushkin.

    ReplyDelete
  10. I'm seeing more code right now, but I question the quality, or really the quantity of how much a good developer could produce by going to sites like codereview.stackexchange.com and just browsing the wealth of archives. That, along with github and communities on reddit and quora have existed for decades and allowed us to quickly combine functions into working programs.

    All these LLMs are doing is speeding up that process, but adding many extra layers. I say extra layers because their design and my design can be totally different and it is very difficult to explain to an llm what I see or how I want it to look.

    Then there's the added frustration of me saying "add this button on at the bottom of the page" and now the llm thinks it needs to redesign the entire concept, which we have been arguing over for hours.

    I think it's great that people are interested in coding, especially mathematics. However, as mathematicians who are interested in proof and academia more than lay coders doing this for hobby, we cannot simply trust an overconfident LLM.

    ReplyDelete
  11. This recent talk by Andrej Karphaty has some interesting points about the state of AI coding:

    https://youtu.be/96jN2OCOfLs

    ReplyDelete
  12. Consider what LLMs do well - predicting the next word, to an extent. There is an advanced model to this, but this is why coding and conversation works so well for them.

    You know what doesn't work well for them? Ciphers and encrption. Because they're bad at math. And with primality based encryption even being off by a little messes your whole system up.

    One of the reason that LLMs are so good at coding is that all the code online is parsed. Dating back to the ancient days of BASIC and COBAL and FORTRAN we always had a start and stop for segments of code. So that made it easily computer / human readable. Same goes for C/C++/Java/Python (a bit different with tabs) / Javascript, etc. Its all online and easily parsable into segments.

    Just think, if that same code that is open to GitHib had been run through a scrambler first, we would have never seen this age of AI Agents.

    There is definitely a "sweet spot" of AI coding, but there is a point that these agents cannot reach (Godel, Cook, Turing).

    It feels like magic right now because much of what we do right now is repetition. We have to copy and paste a lot of old code and these models help with that. But the moment it becomes time (for the model only) to invent a new algorithm or formula or procedure, that is where we see the hallucinations.

    ReplyDelete