Advanced Chain-of-Thought Mastery

Stack CoT with role + self-critique for 60%+ accuracy on multi-evidence reasoning.

THE MINDSET SHIFT

“Naive Chain-of-Thought looks impressive. It writes paragraphs of reasoning. And it picks the wrong answer roughly 4 times out of 10 — because the model still pattern-matches the most familiar story instead of weighing the evidence.”
— SHE · YOUR AI GUIDE

Wei et al. (2022) proved Chain-of-Thought prompting jumps reasoning accuracy 43% on hard tasks. That headline number sent everyone scrambling to bolt "think step by step" onto every prompt. The trap: CoT doesn't make the model think harder, it makes the model write more. Two very different things.

Kahneman's dual-process theory explains why. System 1 (fast, pattern-matching, intuitive) is what generates the obvious first answer. System 2 (slow, deliberate, evidence-weighing) is what catches when System 1 is wrong. Naive CoT fakes System 2 by writing prose, but the underlying inference still runs System 1 — the model commits to a story in the first paragraph and rationalizes the rest.

The production fix is structural. You scaffold the reasoning into discrete checkpoints, force confidence scoring per step, then add an explicit self-critique pass that asks "where am I weighting familiarity over evidence?" This is the move that gets you from 8B-model-level CoT to frontier-model-level reasoning — on the same model.

“Naive CoT improves accuracy 43% on reasoning benchmarks”

Wei et al., Chain-of-Thought Prompting, NeurIPS 2022

“Self-critique + CoT adds another 17–22 points on multi-evidence tasks”

Madaan et al., Self-Refine, NeurIPS 2023

“CoT REDUCES accuracy on simple lookup tasks by ~9%”

Sprague et al., To CoT or Not to CoT, 2024