Wednesday, September 17, 2025

What is "PhD-Level Intelligence"?

When announcing Open-AI's latest release last month, Sam Altman said "GPT-5 is the first time that it really feels like talking to an expert in any topic, like a PhD-level expert." Before we discuss whether GPT-5 got there, what does "PhD-Level intelligence" even mean?

We could just dismiss the idea but I'd rather try to formulate a reasonable definition, which I would expect from a good PhD student. It's not about knowing stuff, which we can always look up, but the ability to talk and engage about current research. Here is my suggestion.

The ability to understand a (well-presented) research paper or talk in the field.

The word "field" has narrowed over time as knowledge has become more specialized, but since the claim is that GPT-5 is an expert over all fields, that doesn't matter. The word "understand" causes more problems, it is hard to define for humans let alone machines.

In many PhD programs, there's an oral exam where we have previously given the candidate a list of research papers and they are expected to answer questions about these papers. If we claim a LLM has "PhD-level" knowledge, I'd expect the LLM to pass this test.

Does GPT-5 get there? I did an experiment with two recent papers, one showing Dijkstra's algorithm was optimal and another showing Dijkstra is not optimal. I used the GPT 5 Thinking model and GPT 5 Pro on the last question about new directions. A little more technical answers than I would have liked but it would likely have passed the oral exam. A good PhD student may work harder to get a more intuitive idea of the paper in order to understand it, and later on extend it.

You could ask for far more--getting a PhD requires significant original research, and LLMs for the most part haven't gotten there (yet). I've not had luck getting any large-language model to make real progress on open questions and haven't seen many successful examples from other people trying to do the same. 

So while large-language models might have PhD-level expertise they can't replace PhD students who actually do the research.

19 comments:

  1. I think the purpose of the oral exam is to test understanding. This works pretty well for humans, since humans can't remember lots of stuff without understanding it. By the way, when I took my orals, it was based on entire courses, not specific papers.

    ReplyDelete
  2. I think it's reasonably easy to define it: PhD-level intelligence is an entity who can be directed to write a PhD thesis (I'll take math/TCS as the illustrating example, the story for humanities is an exercise for the reader). You tell the bot "this is some research paper, try to generalize it to this case", then the bot asks the kinda clarifying questions a bright student might ask, then the bot writes the paper with the generalized theorem.

    Of course, this would depend on the PhD advisor, as a Fields Medalist might ask the student to do less "incremental"-ish contributions, but I would be satisfied with the minimum effort PhD thesis that contains results publishable in non-predatory math journals.

    ReplyDelete
  3. The newer Dijkstra paper cites the older one so I would expect a LLM to adequately navigate the tension there.

    ReplyDelete
  4. Is there a sentence missing from the paragraph that begins with "Does"? There is a period after the word "about" that makes me wonder.

    ReplyDelete
  5. I similarly interpret the "PhD-level intelligence" claims as being a bit overblown... though not by much. Similar to your observation, I think GPT5 and Claude Opus 4.1 are as skilled as any PhD student in what they can accomplish over a very short time horizon. But the key with these models now is getting them to be coherent over a long time horizon. Even the best PhD student requires several years to hammer away on a few problems in a coherent way, slowly learning and building ideas. In any 30-minute conversation, GPT5 and Claude 4 seem to me to be as insightful and skilled as any good PhD student. But lacking the ability to focus on one single problem for months to produce a paper, and then to repeat that a few more times to keep learning and cement the skills, is currently out of reach.

    It's similar to Claude Code being very useful for implementing certain features in software quickly, and with guidance from a human software engineer who's paying close attention and course-correcting, one can build a large complex app faster than one human alone. But currently it seems to be out of reach for the LLM to build the large, complex app completely autonomously, for similar reasons that it cannot focus long enough on one mathematical problem to do original paper-worthy research.

    I know this "time-horizon" idea is a focus of the major labs (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) so I'll be curious if the next versions will continue making progress on time-horizons. Even if current models lose focus after a short time, certainly I don't see any argument why it's fundamentally difficult to scale time horizons to months, though maybe fundamentally difficult with the current transformer architecture, with its quadratic dependence of computation speed on context window length. So perhaps some new architectural discovery is needed for this problem to be solved.

    ReplyDelete
  6. Hard to say what entails a good oral exam. Recall Terry Tao's disappointing performance during his oral exam? What was he asked, and why it a disappointment? Would ChatGPT have performed better?

    There's a real risk of performance and brain activity atrophy using ChatGPT; we've seen analogous technologies that supposedly "help/assist/support" tedious or menial tasks and really all they did is actually rob us of certain brain activities that have critical (yet un-quantifiable) second and third order impact on critical/creative thinking.

    Getting back to your point, I'd assume that ChatGPT can actually help the clueless PhD student in attacking open problems. Surely it won't provide a flawless nor correct resolution of famous conjectures, but "smaller problems" perhaps more concrete problems it might be able to crack and hence deprive you of your mental workout! Welcome, Miss Atrophia!

    What about Solitude is the true school of genius ... in which Chat did this phrase walk out of the room?

    ReplyDelete
    Replies
    1. I believe Terry Tao admitted that he wasn't properly prepared for his orals.

      Delete
    2. I'm not sure about the actual validity of this statement - but you are correct - he indeed shared this view retrospectively.
      it's just a hard pill to swallow for someone of his calibre/talent (even back then as a PhD candidate).
      Be that as it may, ChatGPT is always ready, an all weather proven robust PhD oral exam killer!

      Delete
  7. "The word "understand" causes more problems, it is hard to define for humans let alone machines."

    Sure. But. We normally think of "understanding", at it's absolute mininum, to be some sort of _symbolic_ reasoning (that is reasoning about named concepts*), and LLMs don't do that. If you ask an LLM to multiply 2 integers, the probability of the answer being wrong increases with the length of the integers. Huh? Is this a joke? The LLM technology can't deal with multiplication as a concept, because it doesn't deal with concepts (it deals with tokens). So they look up the answer. In no reasonable sense of the word "understand" do LLMs "understand multiplication".

    Ditto on rotations. (See the discussion of tic tac toe on Andrew Gelman's blog.) Long story short: if you can't reason about rotations, you can't do group theory, which is rather basic math, so it's insane to claim that LLMs "solve math problems".

    As always, if you ask "What can X do?" (or "Can X do Y?") the answer should always be informed by a basic understanding of the operation of X, and, in particular, shouldn't be something X cannot, in principle, do. (Or has been demonstrated not to do.)

    *: Again, back in the 70s and 80s we thought we were figuring out how to do this. It turned out to be harder than we thought. The historical position of the LLM technology is to get around this failure without actually doing the work of figuring out how to do symbolic conceptual reasoning. Which is why some of us see it as It's an off-ramp from any sort of reasonable path towards progress on understanding what understanding actually is.

    ReplyDelete
    Replies
    1. If we required PhD students to multiply large numbers in their head, most if not all would fail. But we let them use calculators, and LLMs who offload these calculations multiply just fine. So I'm not sure why you consider multiplication some sort of test of mathematical understanding.

      Delete
    2. I think Lance has made an argument here. but I equally think post by E makes a nuanced argument. The point is not that PhDs are good, nor perfect at multiplying large numbers, nor that they are outsourcing unquantifiable yet crucial mental activity to devices ... but that they have actually exercised these mental faculties initially and that "they" are present and can be engaged at will, if need be.

      Delete
    3. But it's not the LLM that offloads the calculation, it's a kludged front/rear end that checks for LLM stupidities. The LLM can't, in principle, even do that. Because of the way it works.

      But it is the grad student (or me as early as 1973) who knows how to find a bignum implementation if needed.

      (Remember, there's a humongous amount of human work that goes into the Chatbot to prevent the underlying LLM from going off the rails. There's a whole industry of "jailbreaking", inventing cutesy prompts that get around those kludges to persuade the LMM to do something someone thought to be problematic.)

      Delete
    4. We consider multiplication a test because we expect Ph.D. students to understand multiplication. Not only do LLMs not understand it, but they don't realize that they don't understand it. In fact, from what we know of their architecture, it is extremely unlikely that they could understand it. They are undoubtedly very good at the imitation game (Turing's game in his 1950 paper). So, we should be careful when playing that game with them.

      Delete
    5. David M., in an otherwise sensible post, wrote:

      "They are undoubtedly very good at the imitation game (Turing's game in his 1950 paper). "

      https://courses.cs.umbc.edu/471/papers/turing.pdf

      I wonder if anyone has actually tried to get a chatbot to play this game. (I suppose the LLM folks could, as they do for everything else, provide a ton of faked data for folks playing this game, so the LLM could regurgitate decent answers.)

      "We now ask the question, "What will happen when a machine takes the part of A in this
      game?" Will the interrogator decide wrongly as often when the game is played like this as
      he does when the game is played between a man and a woman? These questions replace
      our original, "Can machines think?" "

      Delete
    6. It turns out I'm wrong about LLM front/rear ends. There have been chatbots that do that (Bing's early thingy), but nowadays what they do is provide a ton of faked data so the LLM itself finds the right answer since the probabilities of the faked stuff are higher than what it would have randomly generated otherwise. Just as ridiculous, but different.

      Delete
  8. The commercially available ones might not, but private ones, which are much more compute intensive, are getting there, e.g. Deepmind's co-scientist.

    The idea behind them is generate a bunch of ideas and then do a search in the tree based on them. It is essentially AlohaGo's algorithm adopted to use LLM. You still need a verifier to discard bad branches, but overall with a lot of compute, I would not be surprised if they can solve new problems.

    Whether it is going to be cheaper than hiring a PhD student, they don't seem to be close to that. But the costs in AI have been going down fast so who knows what we would have 2 years from now.

    Same is true about programming, if you can have a verifier. The problem of course is that program verification is computationally hard, and proof search for correctness using the same technique is still very expensive.

    When Deepmind says they used their systems to solve some new problem, what they mean is that they spent a large chunk of Google's massive compute resources for a few months on something with hundreds of top notch engineers and ML researchers helping it to have a few successful runs.

    ReplyDelete
  9. ChatGPT 5 was pretty underwhelming overall.

    There is a lot of hype in the industry right now, afterall they are making billions from investors by selling the dream of AGI.

    The model quality itself seems to have mostly plateaued for the past year, GPT 5 is not much better than GPT 4 or o3.

    That is why there is a lot of interest in how to use the models in certain new ways, like thinking models (which are essentially the same model, but generating and then being used to comment on what it generated in a loop, etc.).

    ReplyDelete
  10. "Logic will get you from A to B. Imagination will take you everywhere."
    Albert Einstein.

    Can AI imagine?

    ReplyDelete