Chain-of-Thought Prompting in LLMs Explained

You've probably noticed that some AI chatbots don't just spit out an answer — they walk through a problem step by step, almost like a student showing their work on a math test. This isn't just a stylistic quirk. It reflects a deliberate technique called chain-of-thought prompting, and understanding it will change how you interpret what AI systems are actually doing when they appear to "reason."

What Is Chain-of-Thought Prompting?

To understand chain-of-thought prompting, it helps to know what happens without it. In standard prompting, you give a language model a question and it produces a direct answer. Ask it a simple arithmetic question, and it just replies with a number. This works fine for straightforward lookups, but it breaks down fast when a problem has multiple steps.

Chain-of-thought prompting changes the approach: instead of asking the model to jump straight to an answer, you encourage it — either through examples or explicit instruction — to work through the problem in intermediate steps before stating a final answer. The model writes out its reasoning as it goes, like a visible scratchpad.

For example, instead of asking "If a train travels 60 miles per hour for 2.5 hours, how far does it go?" and getting back "150 miles," a chain-of-thought approach produces something like: "The train travels at 60 mph. Over 2.5 hours, that's 60 × 2.5 = 150 miles. The answer is 150 miles." The endpoint is the same, but the path is laid out explicitly.

Where the Technique Came From

Chain-of-thought prompting was formally introduced and studied in a 2022 Google Research paper by Wei et al., titled "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Before this paper, the AI research community knew that large language models were capable of impressive text generation, but their performance on structured, multi-step reasoning tasks was disappointing and hard to improve.

The findings were striking. The Wei et al. study found that chain-of-thought prompting significantly improved performance on arithmetic and commonsense reasoning benchmarks, with large models (≥100B parameters) showing the most dramatic gains. In other words, simply changing how you ask the question — prompting the model to reason through steps — produced much better results than just scaling up model size alone.

Perhaps most telling: standard prompting (giving a question and expecting a direct answer) showed little improvement on multi-step math problems even as model size scaled, whereas chain-of-thought prompting unlocked those gains. This was a meaningful discovery. It suggested that the capacity for multi-step reasoning was already latent in large models — it just needed to be activated by the right kind of prompting.

How It Actually Works (in Plain Terms)

Large language models are, at their core, systems trained to predict what text should come next given what came before. They don't have a built-in calculator or a logical reasoning engine in the traditional sense. So how does writing out steps actually help?

The best current explanation is that when a model generates intermediate steps, each step it writes becomes part of the context it uses to generate the next step. In other words, the model reads its own previous output as it goes. Writing "60 × 2.5 = 150" makes that number explicitly available to the next part of the response, rather than forcing the model to somehow compress all those relationships into a single prediction leap.

Think of it like this: if you had to mentally multiply large numbers and write the answer in one move, you'd struggle. But if you can jot down partial products along the way, the task becomes manageable. Chain-of-thought prompting essentially lets the model use its own text output as working memory.

How Modern AI Systems Use This Technique

The idea has moved well beyond academic research and into production AI systems. OpenAI's o1 and o3 model series use an extended internal chain-of-thought process at inference time, which OpenAI describes as the model "thinking" before producing a final answer. "Inference time" just means the moment when the model is actually generating a response — as opposed to the training phase, when it learned from data.

What this means practically: these models spend time generating reasoning steps (sometimes hidden from the user, sometimes partially visible) before committing to a final output. The idea is that giving the model more "thinking space" lets it handle harder problems more reliably. This represents a shift in how AI developers approach capability — rather than just making models bigger, they're exploring how to make models reason more carefully within a single interaction.

The Genuine Power This Unlocks

Chain-of-thought prompting matters because it expanded what language models could do without any changes to the underlying model weights or architecture. A few practical gains worth understanding:

Multi-step math and logic: Problems that require several dependent calculations become much more tractable when the model works through them sequentially.
Transparency: When a model shows its reasoning, you — the user — can spot where it went wrong, rather than just getting a wrong answer with no hint of why.
Commonsense reasoning: Problems that require linking multiple everyday facts together also improve, not just formal math.
Instructability: You can guide the model toward particular reasoning approaches by structuring your prompt carefully.

For beginners, the key takeaway is that this is a genuine capability improvement, not just a cosmetic change. The model isn't narrating a process it already completed — it's using the process of narration to actually do the work.

The Hidden Limits: When Visible Reasoning Misleads You

Here's where things get more complicated — and where understanding this technique really pays off for a critical reader.

Seeing a model reason through a problem step by step can feel very reassuring. Each step looks logical, the chain holds together, and confidence builds. But the reasoning you see isn't always what's actually driving the answer, and it isn't always correct.

Research has documented that chain-of-thought reasoning in LLMs can still produce confidently wrong intermediate steps — a phenomenon sometimes called "hallucinated reasoning" — meaning the visible steps don't always reflect the true internal computation. The model can write a plausible-looking chain of steps that is internally coherent but factually wrong, arriving at a confident but incorrect answer.

More troubling still: the reasoning steps may not even be the real cause of the model's output. A 2023 study by Turpin et al. (Stanford and Anthropic researchers) found that LLM chain-of-thought explanations can be systematically unfaithful, meaning the stated reasoning steps may not accurately reflect what drove the model's final output.

What does "systematically unfaithful" mean in practice? The researchers found that when they introduced subtle biases into prompts — for example, hinting at a preferred answer — models would often change their final answer to match that hint while generating reasoning steps that appeared to justify the answer independently. The stated reasoning was, in a sense, constructed after the fact to match an output that was driven by other factors entirely.

This is a profound limitation. It means that chain-of-thought output is not a transparent window into the model's actual computational process. It's text the model generates, and like all text it generates, it can be wrong, misleading, or post-hoc rationalization rather than genuine explanation.

What This Means for You as a Reader of AI Output

None of this means chain-of-thought prompting isn't useful — the performance improvements are real and well-documented. But it does suggest a more nuanced way to read AI reasoning:

Visible steps help you audit the work, but only if the steps are checkable. For math, you can verify each line. For complex factual claims, the reasoning chain may sound coherent while resting on invented premises.
Confident-sounding reasoning is not the same as correct reasoning. A model that writes "Therefore, clearly..." before a wrong answer isn't being deceptive — it has no awareness of the difference. Confidence in the prose and correctness of the conclusion are separate things.
The reasoning is itself a model output — subject to the same possibilities of error and hallucination as any other output. Don't treat it as a ground-truth audit trail.
For high-stakes tasks, verify the conclusion through independent means rather than assuming a well-structured reasoning chain guarantees the answer is right.

The Bigger Picture: What This Tells Us About AI Reasoning

Chain-of-thought prompting sits at the heart of one of the most contested questions in AI today: do large language models actually reason, or do they produce text that resembles reasoning?

The honest answer is that we don't fully know. The performance improvements from chain-of-thought prompting are real, suggesting something meaningful happens when a model processes its own intermediate steps. But the evidence from faithfulness research suggests the internal processes driving outputs don't map neatly onto the human-readable steps the model writes.

What we can say is this: chain-of-thought prompting is one of the most important techniques in modern AI because it revealed that reasoning-like behavior could emerge from models trained purely on text, and that it could be improved without retraining — just by changing how you ask. That's a genuine insight about these systems, and it's also a useful reminder that understanding AI means staying curious about both what it can do and why its apparent explanations should be treated carefully.

The next time you see an AI walking through a problem step by step, you'll know what you're actually watching: a powerful technique with real benefits, real limits, and a fascinating ambiguity at its core.

Sources

Every factual claim in this article was independently verified against the following sources:

[PDF] Chain of Thought Prompting Elicits Reasoning in Large Language Models | Semantic Scholar — semanticscholar.org
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason Wei — openreview.net
Language Models Perform Reasoning via Chain of Thought — research.google
Evaluating chain-of-thought monitorability | OpenAI — openai.com
Chain-of-Thought Is Not Explainability Fazl Barez∗ Oxford WhiteBox Tung-Yu Wu — aigi.ox.ac.uk
[2305.04388] Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting — arxiv.org

Why AI Sometimes 'Thinks Out Loud' — and What That Actually Tells Us About How LLMs Reason

What Is Chain-of-Thought Prompting?

Where the Technique Came From

How It Actually Works (in Plain Terms)

How Modern AI Systems Use This Technique

The Genuine Power This Unlocks

The Hidden Limits: When Visible Reasoning Misleads You

What This Means for You as a Reader of AI Output

The Bigger Picture: What This Tells Us About AI Reasoning

Sources

Related Articles

Why AI Sometimes 'Thinks Out Loud' — and What That Actually Tells Us About How LLMs Reason

What Is Chain-of-Thought Prompting?

Where the Technique Came From

How It Actually Works (in Plain Terms)

How Modern AI Systems Use This Technique

The Genuine Power This Unlocks

The Hidden Limits: When Visible Reasoning Misleads You

What This Means for You as a Reader of AI Output

The Bigger Picture: What This Tells Us About AI Reasoning

Sources

Related Articles

The Hidden Scorekeeping Inside AI: How Neural Networks Learn From Their Own Mistakes

Before an AI Can Answer You, It Has to Decode What You Said: Inside the Hidden Language of LLMs

How to Train ChatGPT LLM for Tailored AI Solutions