A,B,C,D,E are digits (the poster said A could be 0 but I took A to be nonzero) such that
ABCDE + BCDE + CDE + DE + E = 20320.
(CLARIFICATION ADDED LATER: We allow two letters to map to the same digit.)
I solved it completely by hand. You can try it yourself or look at my solution which is here. I found seven solutions.
I THEN asked ChatGPT to give me all solutions to see if I missed any.
I had it backwards. ChatGPT missed some solutions. The entire exchange between chatty and me is here.
I asked it how it could get it wrong and how I can trust it. It responded to that and follow-up questions intelligently.
Note that the problem is NOT a Putnam problem or anything of the sort. But I've read that AI can solve Putnam problems. SO, without an ax to grind I am curious- how come ChatGPT got the abcde problem wrong.
Speculative answers
1) The statement AI has solved IMO problems refers to an AI that was trained for Putnam problems, not the free ChatGPT. For more issues with the AI-IMO results see Terry Tao's comments here
2) ChatGPT is really good when the answer to the question is on the web someplace or can even be reconstructed from what's on the web. But if a problem, even an easy one, is new to the web, it can hallucinate (it didn't do that on my problem but it did on muffin problems) or miss some cases (it did that on my problem).
3) It's only human. (It pretty much says this.)
4) The next version or even the paid version is better! Lance ran it on his paid-for chatGPT and it wrote a program to brute force it and got all 7 solutions.
5) I said ChatGPT got the problem wrong. If a student had submitted the solution it would get lots of partial credit since the solution took the right approach and only missed a few cases. So should I judge ChatGPT more harshly than a student? Yes.
The question still stands: How come ChatGPT could not do this well defined simple math problem.
What model did you select? Because GPT-5.2 Thinking got it right for me.
ReplyDeleteUpon getting your comment I brought up the free Chat-GPT that I use and asked it `what version of Chat-GPT are you' - it said it was GPT-5.2. However, what I described in my most I did a while back so I suspect it was 5.1. That is good news- its getting better!
DeleteThe non-reasoning versions are pretty much useless for math. You can compare it to asking a human to answer your math questions but with the restriction that they need to answer immediately based on their gut feelings.
DeleteThe models used by actual mathematicians cost $250 per month and can take 30 minutes or longer to answer a question. There are even better non-commercial models
which models are we talking about?
DeleteThis article by professor of mathematics Daniel Litt gives a great overview of the current state of AI for mathematics: https://www.daniellitt.com/blog/2026/2/20/mathematics-in-the-library-of-babel
ReplyDelete> How come ChatGPT could not do this well defined simple math problem.
ReplyDeleteBecause ChatGPT (like all LLMs) does not understand (math or anything else).
It is like the student in college who turns in homework they have copied from their friends. Sometimes their homework will be correct.
See the YouTube video "That time we asked every #ai if we should walk to the car wash".
Me: "I need to wash my car, and the car wash is 100 meters away. Should I walk or drive?"
DeleteGemini: "Unless you’ve been hitting the gym hard enough to carry a 4,000-pound vehicle on your back, you should probably drive."
Indeed, copying your homework means you sometimes get the homework right. But, you don't learn or understand.
DeleteThere used to be a time up until a year ago that LLM experts thought that these models will scale to solve any thing.
ReplyDeleteNow a days the consensus has shifted that they keep making mistakes and there is no way to fix this by scaling alone.
They are probabilistic output generated, they don't think. That is combined by a recursive self reflection loop over it. Still, they make errors.
The tools, like Python code, that is a different problem. In that case you are not really asking it to solve it but to translate it into a Python problem. These systems are very good with translation. So no surprise there.
If you really want to test them hard, you should tell them not to use tools like Python code interpreter. That will give you what reasoning the model itself is capable of.
ABCDE problem is a small problem. If you turn off tools and all it to solve a more difficult version of it, like 10 letters, even with pen and paper, it will likely fail repeatedly, where a typical K12 student with good math skills would not.
This is not surprising at all if you have a sense of how LLMs actually work. There are good explanations of why LLMs keep making mistakes on simple cases by people like Yann LaCun. Highly recommend to check them out if you have not seen the argument.
The IMO one is a much more complicated system, it is nothing comparable even to the most expensive version of ChatGPT. They are more like AlphaGo with determination proof checker and branching and backtrack and tree search with lots of hands crafted heruestics.
Look, the architecture and mechanism of how our brain works remains unknown as Ken yet or are you telling us that you know how your thoughts are generated. It seem does not seem far fetched to argue that the firing of synapses and the the way things “click” and interact are most likely probabilistic in our brain.
DeleteAnon 5:04
Deletenot sure how that is relevant. we don't need to understand human brain to understand the short comings of these systems.
it is relevant because that's what we are comparing it to assuming mathematicians know how to employ reasoning correctly , mathematicians are not aliens but human beings with brains. So, kinda relevant.
DeleteWe might not know how human brain works, but it is not needed to understanding the short coming of the current models.
DeleteIt doesn't mean in future it can achieve similar performance as humans in many tasks like mathematics, but the current consensus that is emerging is that the LLM architecture has fundamental short coming.
You can see the assessment of experts on the topic like Yaan Lacunn, and more recently even former scaling being sufficient for AGI proponents like Ilya Sutskever and even the commercial lab leaders like Google Deep Mind's CEO Demis Hesabis.
It used to be when people critized short commings of LLM architecture, they hypers would reply in similar manner to your comment in a defensive manner.
But more recently, when Ilya and Demis themselves admit that the issue with LLMs failing on simple cases is not fixable by more scaling and needs fundamentally new ideas, I think we should consider the issue settled.
Last week Claude repeatedly confused the unit 1-disk and the unit interval, in different sessions. And yet it has proved nontrivial theorems for me. These tend to be things that involve lots of bookkeeping. It also lies on most linear algebra problems that have resisted my attempts.
ReplyDeleteI'm hard-pressed to accept that brute-forcing through 10^5 solutions to fit an equation is a measure of intelligence (I initially thought {A, ..., E} were bonafide integers and was like wow this sure looks like a breakthrough result; then nearly fell out of my chair when I realized they're just digits!).
ReplyDeleteWhat then fits the bill? Pretty easy to illustrate using Liu Hui's π algorithm (https://en.wikipedia.org/wiki/Liu_Hui%27s_%CF%80_algorithm): the idea of approximating circles with regular polygons was well established long before his time but working around the (in his time extremely tedious even today with pen and paper) requirement to calculate square roots of irrationals to achieve 3.141024 < pi < 3.142704 was recognized as a landmark achievement whose ideas were further developed upon by subsequent mathematicians.
I wonder how amazed Liu Hui would be by a machine's capability to compute 105 trillion digits; he may be scratching his head what's the point of the actual digits themselves but would certainly appreciate the algorithms, techniques and technology doing the computation (all developed by humans without AI).
Are two or more distinct letters allowed to represent the same digit?
ReplyDeleteThis comment has been removed by the author.
DeleteYES I did allow different letters to map to the same digit.
DeleteBill asked me about this back in November. Probably shouldn't post about what ChatGPT can't do if you wait three months.
ReplyDeleteIs writing a program to search for the digits cheating--the LLM is just doing what I asked it to do. But I asked ChatGPT to solve it logically and it did just fine.
Maybe someone posted the logical approach online and AI crawled it (the problem was posted on linkedin 3 months ago and so has been disseminated widely). I don't have the information to prove/disprove this contention.
DeleteSorry Lance, you didn't ask an LLM, you asked a much more complicated system that is using an LLM as a part of it.
DeleteThey are not the same.
Over time it is expected that they get better. But the fundamental problem that makes LLMs unable to solve the problems allows one to easily generate easy problems that they fail on.
I would suggest you talk to some folks working on improving the quality of these LLM based systems these days. It is a lot of small hacks essentially, not scaling, and that is the reason they perform well on benchmarks but then fail on simple instances of the problems. You can get them to do 80% of the typical simple tasks, but it is exponentially hard to get them do 99% of simple typical tasks.
We see this in building actual software using these systems, like customer support question answering based on ba small fixed set of documents.
hold on. Wait. this is not ChatGPT. it seems to be Gemini. What am I missing?
ReplyDeleteChatGPT also failed to give a (Java) regular expression for all words containg exactly one occurrence of "::".
ReplyDeleteI have successfully missed solutions with e=0, since "duh 4d must be divisible by 4 and 2 isn't" :/
ReplyDeletemy take would be that reasoning is very brittle process (one mistake crumbles whole process) and LLM-based reasoning tools doesn't guarantee success (being on its own a probabilistic approximator)
One interesting way to approach the question of reliability is to ask what such expectation is based on? Most likely we're all thinking how computers can reliably do computations - also a notoriously brittle process. And that points toward the main difference - we already got computations into a suitable form that achieves perfect reliability. Transforming reasoning into "perfect form" such as Lean-checked result would alleviate failures, alas for much higher costs until we figure out some better forms for it
https://www-cs-faculty.stanford.edu/%7Eknuth/papers/claude-cycles.pdf
ReplyDelete