Testing ChatGPT 4.5

On this glorious day of Turing Award presentation to two reinforcement learning nerds, I went back to some of ChatGPT’s failures, feeding the same prompts to the 4.5 version of our future robot overlord.

December 21, 2024: ChatGPT tries to figure out time zones; Today: correct answer!

December 14, 2024: LLM failure on a simple question (“What are examples of museums named after two people with different last names?”) Today: failure once again.

August 2024: ChatGPT 4o tackles the challenge of AC ducts sweating in an attic; Today: complete failure. It concludes that if you put 50-degree air inside an R-30-insulated duct in a warm attic, the outside of the duct will be at 50.8 degrees F and, therefore, the duct will sweat.

The latest version of ChatGPT thinks that pit bulls are, in general, more dangerous than golden retrievers. But it adds an “important nuance”:

Individual temperament, training, socialization, and responsible ownership significantly impact dog behavior.

I followed up with

You’re saying, then, that your chances of being killed by your pet golden retriever are low, but never zero?

and ChatGPT agreed, highlighting “but never zero”. Asked for an example, ChatGPT claimed “A notable fatal incident involving a Golden Retriever occurred in 2012, when an 8-month-old infant in South Carolina was tragically killed by a Golden Retriever.” I found the story:

… found dead in his family’s mobile home …. The baby was in a swing when Lucky, a golden retriever-Labrador mix, bit the child several times and tore off his legs, authorities said. The child’s father, Quintin, was in the home at the time, police said. He was in another room asleep with the family’s 3-year-old and their other dog. The baby was discovered when his mother, Chantel, came home after taking their seven-year-old to a doctor’s appointment, The Post and Courier reported.

Here’s a photo of what a Goldador is supposed to look like:

Based on this photo, I’m not convinced that the mostly peaceful animal is a golden-lab, though a lot of puppies do love to bit arms, hands, legs, and feet!

Let’s try some image generation… “generate a picture of failed flying machine design circa 1900 based on the principle of wing flapping”

This can be considered a fail due to the apparent rigidity of the structure.

3 thoughts on “Testing ChatGPT 4.5

  1. The response from ChatGPT regarding the “flying machine” is a success. It didn’t include any people of color.

  2. Out of curiosity I tried “What are examples of museums named after two people with different last names?” and a few models got accurate results: Grok with “think” selected got 2 , Gemini 2 models got 2, OpenAI’s o3-mini-high got 2, OpenAIs o1 model got 5, Claude 3.7 using extended thinking got 7 and DeepSeek with both “DeepThink” and “search” selected got 12. I didn’t waste an OpenAI “Deep Research” test on it, and Gemini’s “Deep Research” is only a 1.5 model and the results weren’t accurate.

Comments are closed.