Two papers came out the same month proposing to fix the same problem in prompting frameworks. One said the fix was to throw structure away. The other said the fix was to add more of it. Both reported gains. I wanted to understand how they could both be right.

That's a sharper claim than it sounds. It's also, it turns out, a contested one. Two independent evaluations landed after the paper, and a concurrent paper from the same month solves the same stated problem with the opposite move. So this post is less "here is how AoT works" and more "here is what happens when you stack AoT against its critics."

The mechanics

The AoT loop is short to state1. Take the current question. Decompose it into a directed acyclic graph of subquestions, some independent, some dependent. Solve the independent ones. Contract the whole thing into a new, self-contained atomic question that folds the solved answers in as known conditions. Throw the DAG away. Repeat.

This temporary structure is later discarded to eliminate historical dependencies, enabling the Markovian transition.1

That is the whole thesis in one line. Each state transition is memoryless. The only thing carried forward is the reformulated question, not the reasoning that produced it. Compare this to vanilla CoT, where every previous step sits in context and every next step conditions on all of it. Or to ToT and GoT, where the search state is the accumulated structure itself. AoT is the only one of these that explicitly says the history is a liability.

The lineage argument

The paper positions itself inside a lineage. GoT critiqued ToT for being too rigidly tree-shaped. AoT critiques ToT, GoT, and CoT together for a more fundamental sin: all of them let history pile up. The fix, in AoT's telling, is to collapse each step into a fresh atomic question and forget how you got there.

Here is where it gets complicated. AGoT, published the same month as AoT2, identifies the same failure mode (history and structure getting in the way) and moves in the opposite direction. It keeps a DAG, allows recursive nesting of DAGs inside complex nodes, and never discards anything. On shuffled GPQA Diamond with GPT-4o it reports a +46.2% absolute improvement over direct I/O, which the authors note is in the neighborhood of what reinforcement learning distillation buys you.

Two papers, same month, same stated problem, opposite answers. I don't think this is coincidence. I think it's a sign the problem is not well defined yet. "Accumulated history hurts reasoning" is a claim that admits at least two technical responses: throw history away, or organize it so carefully that the model can navigate it. Neither paper really engages with the other's move.

What the paper claims

The AoT paper's own numbers are real. Using gpt-4o-mini, AoT lifts MATH from 78.3 to 83.6, HotpotQA from 67.2 to 80.6, and LongBench from 57.6 to 68.5 over plain CoT1. On MATH it edges out the structured alternatives: 82.0 for ToT, 82.3 for GoT, 82.6 for FoT, 83.6 for AoT. The gains are bigger and more uniform on multi-hop QA than on anything else.

The plugin result is the one I find most interesting. If you run a single AoT decomposition-contraction cycle and then hand the contracted question to Forest of Thoughts with n=2 parallel trees, you match FoT with n=8 at roughly a quarter of the compute. This is a different claim from "AoT is the best standalone framework." It's saying that the atomic restructuring, on its own, takes enough load off downstream search that you can shrink the search itself.

The paper's ablations back the Markov framing harder than I expected. Removing the DAG entirely (keeping decomposition, just without dependency structure) drops MATH to 82.7, which is worse than either removing decomposition altogether or keeping both. The authors put this cleanly:

This reveals a critical insight: imperfect structural guidance can be more detrimental than no guidance at all.

They also name two failure modes of bad contraction. "Destruction of Parallelism" is when merging results from independent subquestions produces an answer that resolves a subproblem instead of the original question. "Destruction of Independence" is when a dependent subquestion loses the pointer back to its prerequisite and gets answered in isolation. Both failures are about what you lose when contraction goes wrong.

The paper also reports that reasoning models improve. o3-mini on AIME goes from 79.6 (CoT) to 83.0 (AoT). DeepSeek-R1 on AIME goes from 78.3 to 81.7. On LongBench the reasoning-model gains are even larger (o3-mini 56.3 to 65.3, DeepSeek-R1 55.1 to 67.9).

What the replications say

Two independent evaluations tell a different story, at least for reasoning models.

The IJIRT paper from November 20253 replicates the non-reasoning gains. They confirm that smaller and instruction-following models benefit from AoT on MATH, GSM8K, MMLU, and HotpotQA. Then they try the same framework on DeepSeek R1, Grok 3 Beta, and o3-mini and find something the original paper doesn't report: AoT degrades performance. On a 50-question MMLU slice, DeepSeek R1 with AoT scored 66.0, versus 96.0 for non-reasoning models with the same framework.

Their hypothesis is the part I can't stop thinking about:

Instead of genuinely decomposing the problem as instructed, they might solve it internally first and then struggle to reverse-engineer the sub-questions and dependencies to match the prompt's requirements, leading to errors or inefficient processing.

If you've read the CoT post, this is the Turpin echo. Turpin showed that models produce CoT explanations that look like the reasoning path but are actually post-hoc justifications of a conclusion already reached4. IJIRT is suggesting the same failure mode one structural level up. The model isn't genuinely running the AoT decomposition. It has an internal answer and is performing the decomposition format on top. The framework thinks it's scaffolding computation. The model is performing scaffolding.

The JMIR 2026 paper5 is more domain-specific but lands in the same place. They tested CoT, AoT, and a retrieval-augmented approach called RoD on predicting 12-week remission in depressive disorder. RoD (retrieval of research evidence plus the model's reasoning) won across every metric. AoT and CoT showed minimal improvement or slight degradation relative to plain zero-shot prompting. Their explanation:

This divergence suggests that for clinical pattern-recognition tasks, the decomposition of reasoning steps alone (as in CoT/AoT) may introduce unnecessary complexity without meaningful benefit.

Different domain, different failure mode. Here it isn't the reasoning model reverse-engineering the decomposition. It's that clinical pattern recognition isn't actually a decomposable problem in the way AoT assumes. What helps is grounding the model in relevant evidence, not restructuring the reasoning process.

So we have the paper saying AoT helps reasoning models. Two independent evaluations saying it doesn't, or that something else helps more. I don't know how to pick a winner here and I don't think I should. The two replications test different things (general reasoning benchmarks vs clinical prediction) with different setups, and neither is the kind of large-scale systematic reproduction that would settle the question. What they do is rhyme. Both point at the same thing: the structure AoT imposes might not be doing what the paper claims it's doing, especially on stronger models.

When it helps, when it doesn't

Putting the evidence together, the pattern I see is:

Consistent wins. Smaller or non-reasoning models on multi-hop QA. The plugin mode as a preprocessor for other test-time scaling methods. These hold up in both the paper and the independent replication.

Contested. Reasoning models. The paper says gains. IJIRT says losses. JMIR says essentially no effect in a different domain.

Open question. Whether the Markov property is essential to AoT's wins, or whether it's one reasonable choice among several. AGoT gets comparable-sized gains by doing the opposite, which makes the Markov claim feel more like a design preference than a mechanism.

Open question. Whether AoT's durable contribution is the standalone framework or the plugin. The plugin result is cleaner (lower compute for comparable accuracy is hard to argue with) and it doesn't depend on the Markov framing being right.

Same shape as the CoT problem

The thing that made me want to write this post is where I ended up, which is basically where I ended up with CoT, one level up.

With CoT, the problem was: I can't tell from the outside whether a given CoT trace is doing real computation or producing a post-hoc rationalization. The mechanism is identical in both cases.

With AoT, the problem is: I can't tell from the outside whether the decomposition-contraction structure is doing real computational work on a given problem, or whether the model is reverse-engineering the structure after the fact. IJIRT's qualitative observation of Gemini 2.0 Flash Thinking (generating answers first, then retroactively fitting the decomposition) is exactly this failure mode. The trace looks like the framework is working. Performance says it isn't.

This isn't a unique problem with AoT. It's a general problem with any prompting framework that imposes structure on a model capable enough to simulate the structure without actually using it. The stronger the model, the more this matters. AoT's wins on smaller models and losses on reasoning models might be exactly what you'd predict: the smaller model needs the scaffolding, the reasoning model performs the scaffolding.

If that's right, then the interesting question for the next round of this family of papers isn't "which structure is best" but "how do we know whether the model is actually using the structure we impose on it, versus pattern-matching on the format." I don't have a good answer. I don't think anyone does yet. But it's the same gap the CoT post ended on, and I think it's where the interesting work is.

Where I landed

AoT is a cleaner abstraction than CoT, ToT, or GoT. The Markov framing is genuinely novel and the plugin result is a real contribution. I'd use AoT-as-preprocessor for multi-hop QA pipelines over smaller models without hesitation.

For reasoning models, I'd be cautious. The paper and the replications disagree, and the replications' hypothesis (the model is reverse-engineering the decomposition, not performing it) is plausible enough that I want more evidence before trusting the framework there.

And the bigger question (are prompting frameworks doing computational work or performing computational theater on sufficiently strong models) is still open. AoT is the most serious attempt I've seen to address it structurally. It's also the clearest demonstration that we don't have a good way to check whether the attempt worked.

Further Reading

References

  1. Teng et al., "Atom of Thoughts for Markov LLM Test-Time Scaling," arXiv preprint, February 2025 (v1), last revised December 2025 (v4). https://arxiv.org/abs/2502.12018 2 3

  2. Pandey, Ghukasyan, Goktas, and Radha, "Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures," arXiv preprint, February 2025. https://arxiv.org/abs/2502.05078

  3. Patil, Patil, Patil, Patange, Patel, Patil, Patil, and Khan, "Evaluating Atom of Thoughts Across Diverse Language Models: A Framework for Enhancing Non-reasoning LLMs Performance," International Journal of Innovative Research in Technology, Volume 12 Issue 6, November 2025.

  4. Turpin et al., "Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting," NeurIPS 2023. https://arxiv.org/abs/2305.04388

  5. Park, Kang, Jeon, Kang, Kim, Kim, and Lee, "Prediction of 12-Week Remission in Patients With Depressive Disorder Using Reasoning-Based Large Language Models: Model Development and Validation Study," JMIR Mental Health, January 2026. https://mental.jmir.org/2026/1/e83352