ChatGPT answers physics questions like a confused C student

The first thing you’ll notice when you ask ChatGPT a question is how smart and knowledgeable its answer sounds. It identifies the proper topic, speaks in intelligible sentences, and employs the expert tone of an educated human. The million-dollar question is: Does the AI give correct answers?

While ChatGPT (or any other chatbot) is obviously not sentient, its output is reminiscent of a person in certain ways. That’s not surprising, given that it mimics human language patterns. I’ve described ChatGPT as a parrot watching a million years of soap operas. The AI is very good at stringing together sentences simply because it has seen so many of them — it just doesn’t understand them.

But given its demonstrated abilities, such as acing a microbiology quiz, I asked ChatGPT a battery of physics questions, from relatively simple undergraduate subjects to specialized expert topics. I wasn’t interested in its ability to recite information or crunch numbers. (You can ask WolframAlpha or a search engine to do this.) Instead, I wanted to see if ChatGPT could interpret and give useful responses to the kinds of questions that a specialist human might be expected to answer.

A mediocre C student

All told, ChatGPT’s performance wasn’t up to par for an expert. It reminded me of a hardworking C student: one who doesn’t understand the material, but memorizes very well and puts in extra effort to eke out credit and pass the class. Let’s look at this in more detail.

The AI usually begins by regurgitating your question using more words or redefining the term you asked it about. (Thanks, but I have 50 exams to grade, so please don’t waste my time.) It later re-regurgitates, forming a miniature conclusion. (Now I’m getting irritated. A strong student gives concise, correct answers. A weaker student stumbles through long answers with convoluted explanations.)

In response to a simple question, ChatGPT generally produces three or four paragraphs of output. This usually contained the right answer, which was impressive. However, it sometimes included additional wrong answers. It also often contained extraneous details, related but unimportant facts, and definitions of partially irrelevant terms. The breadth of concepts imparted from its training is impressive, but the links between them are often nebulous. It can tell you what, but not why.

If I asked you why it was dark in here, and you said, “Because the light is off,” you’d be correct, but you’re not really telling me anything useful. I hope you wouldn’t go on to tell me about the definition of light, how light can be measured, and what colors make up light before summarizing that something that’s dark isn’t light. But that’s the sort of answer ChatGPT would provide.

ChatGPT’s word salad

When asked a harder question, ChatGPT tries to score points by shotgunning you with answer pellets. Each answer says a modest amount, using a lot of unnecessary words. In this way, the AI reminds me of a student who lacks full conceptual understanding and gives multiple explanations, elaborated in confusing ways, hoping to hit on something correct for partial credit and win extra points for effort.

ChatGPT’s response to each of my difficult questions consisted of a mix of good correct answers, partially correct answers with incorrect portions, answers that stated factual information but didn’t ultimately explain anything, answers that might be true but were irrelevant, and answers that were dead wrong. The wrong answers included full explanations that sounded reasonable, but were total nonsense on close reading.

Confoundingly, I cannot predict when the AI will give a right answer or a wrong one. It can give a confused response to a simple question and an impressive reply to an arcane query. ChatGPT also throws extraneous related information on top for brownie points, but often this just gets it into trouble.

Confident but wrong

More than once, I received an answer in which the AI would start by giving a correct definition. (Usually, it was restating the Wikipedia entry related to the topic, which is the student equivalent of rote memorization.) Then the AI would elaborate but say something completely wrong or backward. This reinforces my impression that the model seems well trained on what concepts are linked together, but it is unable to capture the nature of those relationships.

For example, ChatGPT knows A is related to B. However, it often doesn’t know if A implies B, or if A precludes B. It may mistake whether A and B are directly correlated or inversely correlated. Possibly A and B are just similar topics with no relevant relationship, but when asked about A, it tells you about A and then yammers on about B.

Beyond tabulating right and wrong scores, human factors matter in a human evaluation of the AI. It’s easy to overestimate ChatGPT’s ability because of its writing and tone. The answers are written well, read coherently, and give the impression of authority. If you don’t know the true answer to your own question, ChatGPT’s answer will make you believe that it knows.

This is troubling. If someone is a fool and talks like one, we can easily tell; if someone is a fool but well spoken, we might start to believe them. For sure, ChatGPT could give you the right answer or useful information. But it could just as eloquently and convincingly give you a wrong answer, a convenient or malicious lie, or propaganda embedded by its training data or human hands. ChatGPT may be a C student, but C students run the world.