Wednesday, September 17, 2025

What is "PhD-Level Intelligence"?

When announcing Open-AI's latest release last month, Sam Altman said "GPT-5 is the first time that it really feels like talking to an expert in any topic, like a PhD-level expert." Before we discuss whether GPT-5 got there, what does "PhD-Level intelligence" even mean?

We could just dismiss the idea but I'd rather try to formulate a reasonable definition, which I would expect from a good PhD student. It's not about knowing stuff, which we can always look up, but the ability to talk and engage about current research. Here is my suggestion.

The ability to understand a (well-presented) research paper or talk in the field.

The word "field" has narrowed over time as knowledge has become more specialized, but since the claim is that GPT-5 is an expert over all fields, that doesn't matter. The word "understand" causes more problems, it is hard to define for humans let alone machines.

In many PhD programs, there's an oral exam where we have previously given the candidate a list of research papers and they are expected to answer questions about these papers. If we claim a LLM has "PhD-level" knowledge, I'd expect the LLM to pass this test.

Does GPT-5 get there? I did an experiment with two recent papers, one showing Dijkstra's algorithm was optimal and another showing Dijkstra is not optimal. I used the GPT 5 Thinking model and GPT 5 Pro on the last question about new directions. A little more technical answers than I would have liked but it would likely have passed the oral exam. A good PhD student may work harder to get a more intuitive idea of the paper in order to understand it, and later on extend it.

You could ask for far more--getting a PhD requires significant original research, and LLMs for the most part haven't gotten there (yet). I've not had luck getting any large-language model to make real progress on open questions and haven't seen many successful examples from other people trying to do the same. 

So while large-language models might have PhD-level expertise they can't replace PhD students who actually do the research.

6 comments:

  1. I think the purpose of the oral exam is to test understanding. This works pretty well for humans, since humans can't remember lots of stuff without understanding it. By the way, when I took my orals, it was based on entire courses, not specific papers.

    ReplyDelete
  2. I think it's reasonably easy to define it: PhD-level intelligence is an entity who can be directed to write a PhD thesis (I'll take math/TCS as the illustrating example, the story for humanities is an exercise for the reader). You tell the bot "this is some research paper, try to generalize it to this case", then the bot asks the kinda clarifying questions a bright student might ask, then the bot writes the paper with the generalized theorem.

    Of course, this would depend on the PhD advisor, as a Fields Medalist might ask the student to do less "incremental"-ish contributions, but I would be satisfied with the minimum effort PhD thesis that contains results publishable in non-predatory math journals.

    ReplyDelete
  3. Is there a sentence missing from the paragraph that begins with "Does"? There is a period after the word "about" that makes me wonder.

    ReplyDelete
  4. I similarly interpret the "PhD-level intelligence" claims as being a bit overblown... though not by much. Similar to your observation, I think GPT5 and Claude Opus 4.1 are as skilled as any PhD student in what they can accomplish over a very short time horizon. But the key with these models now is getting them to be coherent over a long time horizon. Even the best PhD student requires several years to hammer away on a few problems in a coherent way, slowly learning and building ideas. In any 30-minute conversation, GPT5 and Claude 4 seem to me to be as insightful and skilled as any good PhD student. But lacking the ability to focus on one single problem for months to produce a paper, and then to repeat that a few more times to keep learning and cement the skills, is currently out of reach.

    It's similar to Claude Code being very useful for implementing certain features in software quickly, and with guidance from a human software engineer who's paying close attention and course-correcting, one can build a large complex app faster than one human alone. But currently it seems to be out of reach for the LLM to build the large, complex app completely autonomously, for similar reasons that it cannot focus long enough on one mathematical problem to do original paper-worthy research.

    I know this "time-horizon" idea is a focus of the major labs (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) so I'll be curious if the next versions will continue making progress on time-horizons. Even if current models lose focus after a short time, certainly I don't see any argument why it's fundamentally difficult to scale time horizons to months, though maybe fundamentally difficult with the current transformer architecture, with its quadratic dependence of computation speed on context window length. So perhaps some new architectural discovery is needed for this problem to be solved.

    ReplyDelete
  5. Hard to say what entails a good oral exam. Recall Terry Tao's disappointing performance during his oral exam? What was he asked, and why it a disappointment? Would ChatGPT have performed better?

    There's a real risk of performance and brain activity atrophy using ChatGPT; we've seen analogous technologies that supposedly "help/assist/support" tedious or menial tasks and really all they did is actually rob us of certain brain activities that have critical (yet un-quantifiable) second and third order impact on critical/creative thinking.

    Getting back to your point, I'd assume that ChatGPT can actually help the clueless PhD student in attacking open problems. Surely it won't provide a flawless nor correct resolution of famous conjectures, but "smaller problems" perhaps more concrete problems it might be able to crack and hence deprive you of your mental workout! Welcome, Miss Atrophia!

    What about Solitude is the true school of genius ... in which Chat did this phrase walk out of the room?

    ReplyDelete