How to Identify and Address Confident Errors in Large Language Models: A Case Study on the 'Strawberry' Problem

From Fonarow, the free encyclopedia of technology

Overview

Large language models (LLMs) like ChatGPT have revolutionized how we interact with AI, yet they remain prone to a peculiar type of failure: confident mistakes. These are errors delivered with such certainty that they can easily mislead users. A classic example is ChatGPT's historical inability to correctly count the number of 'R's in the word "strawberry." While OpenAI has recently improved this specific behavior, many other similar pitfalls persist. This tutorial guides you through understanding why these mistakes happen, how to test for them, and how to critically evaluate AI outputs.

How to Identify and Address Confident Errors in Large Language Models: A Case Study on the 'Strawberry' Problem
Source: 9to5google.com

Prerequisites

Before diving in, ensure you have:

  • A basic understanding of what large language models are (e.g., transformer-based neural networks).
  • Access to a current version of ChatGPT (free or paid) or another similar LLM-based chatbot for hands-on testing.
  • Familiarity with prompt engineering – how to structure inputs to get reliable responses.
  • A critical mindset – the most important tool for spotting confident errors.

Step-by-Step Instructions

Step 1: Reproduce the Classic 'Strawberry' Test

To see the issue firsthand, ask the AI: "How many 'R's are in the word 'strawberry'?". In older versions, ChatGPT would often answer "3" or even "4" instead of the correct "2" (the word contains two 'R's, not counting the Y). Test with your current version to compare. Record the response and note the confidence level (e.g., "There are 3 R's in strawberry").

Step 2: Expand the Test with Other Letter-Counting Prompts

Try variations:

  • "How many 'E's in 'elephant'?" (Actually 2, but models may miscount due to tokenization).
  • "Count the 'L's in 'lollipop'." (Correct: 3).
  • "How many 'T's in 'that'?" (Correct: 2).

Log the answers. This reveals pattern: LLMs often fail at simple, deterministic tasks because they process text as chunks (tokens) rather than individual characters.

Step 3: Analyze Why the Errors Occur

LLMs do not see letters directly. They tokenize input into subword units. For example, "strawberry" might be split into tokens like "straw" and "berry" or "stra", "w", "berry". The model then predicts the next token based on probability, not by counting. When asked directly to count, the model attempts to generate an answer from patterns in training data, but because counting precise characters is not a natural task for it, it guesses with high confidence – often incorrectly.

Step 4: Test for Other Confident Mistakes Beyond Counting

Letter counting is just one example. Try:

  • Mathematical reasoning: Ask "What is 2+2?" (trivial), but then "What is 2,345 * 1,231?" – the model may give a plausible-looking but wrong product.
  • Factual claims: Query "Who won the Super Bowl in 2020?" (correct: Kansas City Chiefs), but then "Who won in 1920?" – the model might invent a game that didn't exist.
  • Citation fabrication: Ask ChatGPT to provide a specific scientific reference. It may generate a fake paper title and author with convincing detail.

Record your prompts and responses to see how often the model sounds confident even when wrong.

Step 5: Apply Critical Evaluation Techniques

To avoid being misled by confident mistakes:

  1. Verify with external sources: Use a search engine or calculator to double-check facts and calculations.
  2. Ask for reasoning: Prompt the AI to show its working: "Explain step by step how you counted the R's." Often, the reasoning reveals logical gaps (e.g., it might say "The word is s-t-r-a-w-b-e-r-r-y – letters 3, 4, and 8 are R?" Actually letters 3 and 8 are correct, but not 4).
  3. Cross-check with different models: Ask the same question to Gemini, Claude, or open-source models to see if errors are consistent.
  4. Use explicit instructions: For counting tasks, specify "Please treat each character individually, including spaces and punctuation. Write out each character and then count." This forces the model into a more reliable step-by-step mode.

Step 6: Understand the Root Cause – Overconfidence in Training

LLMs are trained to predict the next word, not to introspect about uncertainty. They have no built-in mechanism for saying "I don't know." Instead, they are optimized to produce fluent, confident-sounding responses. The famous strawberry error persists because when the model fails, it still generates an answer that fits the pattern of responses seen in training data. OpenAI's recent fix for the strawberry case likely involved fine-tuning on a dataset with explicit character-counting examples, but this addresses only the symptom, not the underlying architecture.

How to Identify and Address Confident Errors in Large Language Models: A Case Study on the 'Strawberry' Problem
Source: 9to5google.com

Step 7: Evaluate the ‘Victory Lap’ – What OpenAI Fixed and What Remains

OpenAI boasted that the strawberry letter counting now works correctly in their latest model. Test it: ask directly, and you'll likely get "2". However, as social media responses have shown, other mistakes persist. For instance, ask for the number of 'R's in 'strawberries' (still 2), or 'irreverent' (5?), or a nonsense word. The model may still fail on unfamiliar or more complex counting tasks. More critically, errors in reasoning (e.g., logic puzzles, code bugs) remain abundant. This highlights that fixing one benchmark example does not solve the confidence problem.

Common Mistakes When Working with LLMs

Mistake 1: Assuming Fluency Equals Accuracy

Just because the model writes a coherent paragraph does not mean its facts are correct. Always double-check any quantitative or factual claim, especially if it sounds surprising.

Mistake 2: Using LLMs for Deterministic Tasks Without Guardrails

Do not rely on ChatGPT for tasks that require exactness, such as word counts, character counts, or arithmetic. Use dedicated tools (a word processor's count feature, a calculator) instead.

Mistake 3: Not Prompting for Step-by-Step Reasoning

Asking for a direct answer increases the chance of a hallucinated result. By requesting chain-of-thought reasoning, you often get a more reliable outcome – though not guaranteed.

Mistake 4: Believing That Recent Fixes Make the Model Perfect

The strawberry fix is an isolated patch. General overconfidence remains a feature, not a bug. Treat all outputs with healthy skepticism, especially for high-stakes decisions.

Summary

Confident mistakes in LLMs like ChatGPT stem from their token-based processing and training that rewards plausible-sounding answers over accurate self-assessment. The classic strawberry letter-counting failure illustrates this well, and while OpenAI has patched that specific query, the underlying weakness persists. By testing systematically, analyzing why errors occur, and applying critical evaluation techniques, you can mitigate the risk of being misled. Always remember: an AI's confidence does not equal correctness.