Computational Complexity: ChatGPT gets an easy math problem wrong (I got it right). How is that possible?

Sunday, February 22, 2026

ChatGPT gets an easy math problem wrong (I got it right). How is that possible?

A commenter on this post asked for me (or anyone) to solve the problem without AI:

A,B,C,D,E are digits (the poster said A could be 0 but I took A to be nonzero) such that

ABCDE + BCDE + CDE + DE + E = 20320.

(CLARIFICATION ADDED LATER: We allow two letters to map to the same digit.)

I solved it completely by hand. You can try it yourself or look at my solution which is here. I found seven solutions.

I THEN asked ChatGPT to give me all solutions to see if I missed any.

I had it backwards. ChatGPT missed some solutions. The entire exchange between chatty and me is here.

I asked it how it could get it wrong and how I can trust it. It responded to that and follow-up questions intelligently.

Note that the problem is NOT a Putnam problem or anything of the sort. But I've read that AI can solve Putnam problems. SO, without an ax to grind I am curious- how come ChatGPT got the abcde problem wrong.

Speculative answers

1) The statement AI has solved IMO problems refers to an AI that was trained for Putnam problems, not the free ChatGPT. For more issues with the AI-IMO results see Terry Tao's comments here

2) ChatGPT is really good when the answer to the question is on the web someplace or can even be reconstructed from what's on the web. But if a problem, even an easy one, is new to the web, it can hallucinate (it didn't do that on my problem but it did on muffin problems) or miss some cases (it did that on my problem).

3) It's only human. (It pretty much says this.)

4) The next version or even the paid version is better! Lance ran it on his paid-for chatGPT and it wrote a program to brute force it and got all 7 solutions.

5) I said ChatGPT got the problem wrong. If a student had submitted the solution it would get lots of partial credit since the solution took the right approach and only missed a few cases. So should I judge ChatGPT more harshly than a student? Yes.

The question still stands: How come ChatGPT could not do this well defined simple math problem.

25 comments:

Alexander Kruel5:16 PM, February 22, 2026
What model did you select? Because GPT-5.2 Thinking got it right for me.
ReplyDelete
Replies
Alexander Kruel5:37 PM, February 22, 2026
This article by professor of mathematics Daniel Litt gives a great overview of the current state of AI for mathematics: https://www.daniellitt.com/blog/2026/2/20/mathematics-in-the-library-of-babel
ReplyDelete
Replies
David Marcus6:34 PM, February 22, 2026
> How come ChatGPT could not do this well defined simple math problem.

Because ChatGPT (like all LLMs) does not understand (math or anything else).

It is like the student in college who turns in homework they have copied from their friends. Sometimes their homework will be correct.

See the YouTube video "That time we asked every #ai if we should walk to the car wash".
ReplyDelete
Replies
Anonymous1:22 AM, February 23, 2026
There used to be a time up until a year ago that LLM experts thought that these models will scale to solve any thing.

Now a days the consensus has shifted that they keep making mistakes and there is no way to fix this by scaling alone.

They are probabilistic output generated, they don't think. That is combined by a recursive self reflection loop over it. Still, they make errors.

The tools, like Python code, that is a different problem. In that case you are not really asking it to solve it but to translate it into a Python problem. These systems are very good with translation. So no surprise there.

If you really want to test them hard, you should tell them not to use tools like Python code interpreter. That will give you what reasoning the model itself is capable of.

ABCDE problem is a small problem. If you turn off tools and all it to solve a more difficult version of it, like 10 letters, even with pen and paper, it will likely fail repeatedly, where a typical K12 student with good math skills would not.

This is not surprising at all if you have a sense of how LLMs actually work. There are good explanations of why LLMs keep making mistakes on simple cases by people like Yann LaCun. Highly recommend to check them out if you have not seen the argument.

The IMO one is a much more complicated system, it is nothing comparable even to the most expensive version of ChatGPT. They are more like AlphaGo with determination proof checker and branching and backtrack and tree search with lots of hands crafted heruestics.
ReplyDelete
Replies
Steve Huntsman7:39 AM, February 23, 2026
Last week Claude repeatedly confused the unit 1-disk and the unit interval, in different sessions. And yet it has proved nontrivial theorems for me. These tend to be things that involve lots of bookkeeping. It also lies on most linear algebra problems that have resisted my attempts.
ReplyDelete
Replies
space200111:11 AM, February 23, 2026
I'm hard-pressed to accept that brute-forcing through 10^5 solutions to fit an equation is a measure of intelligence (I initially thought {A, ..., E} were bonafide integers and was like wow this sure looks like a breakthrough result; then nearly fell out of my chair when I realized they're just digits!).

What then fits the bill? Pretty easy to illustrate using Liu Hui's π algorithm (https://en.wikipedia.org/wiki/Liu_Hui%27s_%CF%80_algorithm): the idea of approximating circles with regular polygons was well established long before his time but working around the (in his time extremely tedious even today with pen and paper) requirement to calculate square roots of irrationals to achieve 3.141024 < pi < 3.142704 was recognized as a landmark achievement whose ideas were further developed upon by subsequent mathematicians.

I wonder how amazed Liu Hui would be by a machine's capability to compute 105 trillion digits; he may be scratching his head what's the point of the actual digits themselves but would certainly appreciate the algorithms, techniques and technology doing the computation (all developed by humans without AI).
ReplyDelete
Replies
Anonymous5:26 PM, February 23, 2026
Are two or more distinct letters allowed to represent the same digit?
ReplyDelete
Replies
Lance Fortnow3:39 AM, February 24, 2026
Bill asked me about this back in November. Probably shouldn't post about what ChatGPT can't do if you wait three months.

Is writing a program to search for the digits cheating--the LLM is just doing what I asked it to do. But I asked ChatGPT to solve it logically and it did just fine.
ReplyDelete
Replies
Anonymous8:09 AM, February 24, 2026
hold on. Wait. this is not ChatGPT. it seems to be Gemini. What am I missing?
ReplyDelete
Replies
Roland12:51 PM, February 25, 2026
ChatGPT also failed to give a (Java) regular expression for all words containg exactly one occurrence of "::".
ReplyDelete
Replies
Anonymous11:02 PM, February 26, 2026
I have successfully missed solutions with e=0, since "duh 4d must be divisible by 4 and 2 isn't" :/

my take would be that reasoning is very brittle process (one mistake crumbles whole process) and LLM-based reasoning tools doesn't guarantee success (being on its own a probabilistic approximator)

One interesting way to approach the question of reliability is to ask what such expectation is based on? Most likely we're all thinking how computers can reliably do computations - also a notoriously brittle process. And that points toward the main difference - we already got computations into a suitable form that achieves perfect reliability. Transforming reasoning into "perfect form" such as Lean-checked result would alleviate failures, alas for much higher costs until we figure out some better forms for it
ReplyDelete
Replies
Anonymous1:23 PM, March 13, 2026
https://www-cs-faculty.stanford.edu/%7Eknuth/papers/claude-cycles.pdf

ReplyDelete
Replies

Add comment