Sunday, February 22, 2026

ChatGPT gets an easy math problem wrong (I got it right). How is that possible?

A commenter on this post asked for me (or anyone) to solve the problem without AI:

A,B,C,D,E are digits (the poster said A could be 0 but I took A to be nonzero)  such that

ABCDE + BCDE + CDE + DE + E = 20320.

I solved it completely by hand. You can try it yourself or look at my solution which is here. I found seven solutions. 

I THEN asked ChatGPT to give me all solutions to see if I missed any. 
 
I had it backwards. ChatGPT missed some solutions.  The entire exchange between chatty and me is here.

I asked it how it could get it wrong and how I can trust it. It responded to that and follow-up questions intelligently. 

Note that the problem is NOT a Putnam problem or anything of the sort. But I've read that AI can solve Putnam problems. SO, without an ax to grind I am curious- how come ChatGPT got the abcde problem wrong.

Speculative answers

1) The statement AI has solved IMO problems refers to an AI that was trained for Putnam problems, not the free ChatGPT. For more issues with the AI-IMO results see Terry Tao's comments here

2) ChatGPT is really good when the answer to the question is on the web someplace or can even be reconstructed from what's on the web. But if a problem, even an easy one, is new to the web, it can hallucinate (it didn't do that on my problem but it did on muffin problems) or miss some cases (it did that on my problem). 

3) It's only human. (It pretty much says this.)

4) The next version or even the paid version is better!  Lance ran it on his paid-for chatGPT and it wrote a program to brute force it and got all 7 solutions. 

5) I said ChatGPT got the problem wrong. If a student had submitted the solution it would get lots of partial credit since the solution took the right approach and only missed a few cases. So should I judge ChatGPT more harshly than a student? Yes. 

The question still stands: How come ChatGPT could not do this well defined simple math problem. 




11 comments:

  1. What model did you select? Because GPT-5.2 Thinking got it right for me.

    ReplyDelete
    Replies
    1. Upon getting your comment I brought up the free Chat-GPT that I use and asked it `what version of Chat-GPT are you' - it said it was GPT-5.2. However, what I described in my most I did a while back so I suspect it was 5.1. That is good news- its getting better!

      Delete
    2. The non-reasoning versions are pretty much useless for math. You can compare it to asking a human to answer your math questions but with the restriction that they need to answer immediately based on their gut feelings.

      The models used by actual mathematicians cost $250 per month and can take 30 minutes or longer to answer a question. There are even better non-commercial models

      Delete
    3. which models are we talking about?

      Delete
  2. This article by professor of mathematics Daniel Litt gives a great overview of the current state of AI for mathematics: https://www.daniellitt.com/blog/2026/2/20/mathematics-in-the-library-of-babel

    ReplyDelete
  3. > How come ChatGPT could not do this well defined simple math problem.

    Because ChatGPT (like all LLMs) does not understand (math or anything else).

    It is like the student in college who turns in homework they have copied from their friends. Sometimes their homework will be correct.

    See the YouTube video "That time we asked every #ai if we should walk to the car wash".

    ReplyDelete
    Replies
    1. Me: "I need to wash my car, and the car wash is 100 meters away. Should I walk or drive?"

      Gemini: "Unless you’ve been hitting the gym hard enough to carry a 4,000-pound vehicle on your back, you should probably drive."

      Delete
    2. Indeed, copying your homework means you sometimes get the homework right. But, you don't learn or understand.

      Delete
  4. There used to be a time up until a year ago that LLM experts thought that these models will scale to solve any thing.

    Now a days the consensus has shifted that they keep making mistakes and there is no way to fix this by scaling alone.

    They are probabilistic output generated, they don't think. That is combined by a recursive self reflection loop over it. Still, they make errors.

    The tools, like Python code, that is a different problem. In that case you are not really asking it to solve it but to translate it into a Python problem. These systems are very good with translation. So no surprise there.

    If you really want to test them hard, you should tell them not to use tools like Python code interpreter. That will give you what reasoning the model itself is capable of.

    ABCDE problem is a small problem. If you turn off tools and all it to solve a more difficult version of it, like 10 letters, even with pen and paper, it will likely fail repeatedly, where a typical K12 student with good math skills would not.

    This is not surprising at all if you have a sense of how LLMs actually work. There are good explanations of why LLMs keep making mistakes on simple cases by people like Yann LaCun. Highly recommend to check them out if you have not seen the argument.

    The IMO one is a much more complicated system, it is nothing comparable even to the most expensive version of ChatGPT. They are more like AlphaGo with determination proof checker and branching and backtrack and tree search with lots of hands crafted heruestics.

    ReplyDelete
  5. Last week Claude repeatedly confused the unit 1-disk and the unit interval, in different sessions. And yet it has proved nontrivial theorems for me. These tend to be things that involve lots of bookkeeping. It also lies on most linear algebra problems that have resisted my attempts.

    ReplyDelete
  6. I'm hard-pressed to accept that brute-forcing through 10^5 solutions to fit an equation is a measure of intelligence (I initially thought {A, ..., E} were bonafide integers and was like wow this sure looks like a breakthrough result; then nearly fell out of my chair when I realized they're just digits!).

    What then fits the bill? Pretty easy to illustrate using Liu Hui's π algorithm (https://en.wikipedia.org/wiki/Liu_Hui%27s_%CF%80_algorithm): the idea of approximating circles with regular polygons was well established long before his time but working around the (in his time extremely tedious even today with pen and paper) requirement to calculate square roots of irrationals to achieve 3.141024 < pi < 3.142704 was recognized as a landmark achievement whose ideas were further developed upon by subsequent mathematicians.

    I wonder how amazed Liu Hui would be by a machine's capability to compute 105 trillion digits; he may be scratching his head what's the point of the actual digits themselves but would certainly appreciate the algorithms, techniques and technology doing the computation (all developed by humans without AI).

    ReplyDelete