Skip to main content

Researchers discover why Chain of Thought works

Uncovering why CoT works so well!
Created on September 18|Last edited on September 18
Models such as GPT and BERT have demonstrated exceptional abilities in natural language processing, but they have limitations—particularly when solving problems that require serial computation, where each step depends on the result of the previous one. An intriguing technique called Chain of Thought (CoT) can enhance these models, making them capable of handling tasks that require complex, step-by-step reasoning.
A fundamental challenge for transformers is their inherent design as parallel processing models. Transformers process all tokens in parallel, allowing them to perform tasks like language modeling and classification efficiently. However, serial tasks, where operations need to be performed one step at a time, are more difficult for these models to handle. These tasks require a transformer to remember intermediate steps and build upon them, which is not something transformers do well in their standard form.

Parallel vs. Serial Computation

To understand why transformers struggle with serial tasks, it’s important to contrast parallel and serial computation. In parallel computation, different parts of a task are handled simultaneously, making it well-suited for tasks where the individual components can be processed independently of each other. For instance, tasks like counting words in a document can be processed in parallel, as each word can be counted independently. However, in serial computation, each step relies on the previous one, like calculating the factorial of a number, where each number is multiplied sequentially. Transformers, without modifications, excel at parallel tasks but face difficulties with serial computation.

Why Transformers Are Limited Without CoT

The limitations of transformers without CoT can be explained by their computational power, which is comparable to "AC0 circuits." In the field of circuit complexity, AC0 and TC0 are two classes of Boolean circuits that help define the computational limits of parallel systems. AC0 circuits consist of logical gates like AND, OR, and NOT with constant depth, and they are efficient at solving parallel tasks. However, these circuits cannot handle tasks that require counting or other types of serial computation. AC0 circuits fail to solve even basic counting problems like parity checking—determining whether the number of 1s in a binary string is odd or even—because they lack the capacity to maintain intermediate states or remember previous steps.
TC0 circuits, on the other hand, are slightly more powerful because they include MAJORITY gates, which allow them to perform threshold-based decisions, such as counting how many inputs meet a certain condition. Even so, TC0 circuits are still limited in their ability to handle deeply serial tasks.
Constant-depth transformers (those with a fixed number of layers) behave similarly to AC0 circuits because they rely heavily on parallelism within each layer. As a result, these transformers are great for parallel tasks but struggle with problems that require sequential processing or maintaining intermediate results.

How Chain of Thought Overcomes These Limits

Chain of Thought allows transformers to break down complex problems into multiple intermediate reasoning steps before producing a final answer. Rather than trying to solve a problem in one pass, CoT enables transformers to generate intermediate outputs that represent partial solutions to the task at hand. This process simulates serial computation by allowing the transformer to perform multiple steps in sequence, improving its ability to solve inherently serial problems.
For example, when faced with a problem like permutation composition, which requires applying permutations one by one in a specific order, a transformer without CoT would struggle to maintain the state required to apply each permutation in sequence. CoT helps by breaking the problem down: after each permutation, the transformer generates an intermediate result, which then serves as the input for the next permutation. This allows the model to keep track of intermediate steps and apply each transformation in the correct order.

How Many CoT Steps Are Needed?

CoT significantly enhances a transformer's power, but the number of CoT steps plays a crucial role in determining how much improvement can be achieved. The authors of the paper demonstrate that if the number of CoT steps grows only logarithmically with the input size (e.g., T(n) = O(log n)), the transformer remains limited to AC0-like behavior. In other words, even with a few intermediate steps, the model still cannot perform complex serial computations because a logarithmic number of CoT steps is insufficient to simulate true sequential reasoning.
However, when the number of CoT steps grows polynomially with the input size (e.g., T(n) = O(n^k)), the transformer’s power increases dramatically. It can now simulate any problem solvable by a P/poly circuit, which includes tasks that require multiple intermediate reasoning steps. In computational complexity theory, the class P/poly refers to problems that can be solved by circuits of polynomial size, meaning the number of logic gates in the circuit grows polynomially with the input size. With polynomially many CoT steps, transformers can handle deeply serial problems such as iterated squaring (repeatedly squaring a number), which requires maintaining and updating a running state across several steps.

Datasets Used in the Paper and How CoT is Applied

To test how well CoT helps transformers solve serial tasks, the authors designed several synthetic datasets that simulate different types of serial computation problems. These include tasks like modular addition, permutation composition, iterated squaring, and the circuit value problem.
In the modular addition task, the model is given a sequence of numbers and must compute their sum modulo a constant, such as 7. This task is a relatively simple one, and transformers without CoT perform well because it is parallelizable. However, even for this task, CoT can help by breaking the problem into intermediate steps, such as calculating partial sums before applying the modulo operation.
For the permutation composition task, the model is required to apply a series of permutations to a sequence of numbers, one step at a time. This task is inherently serial because each permutation depends on the result of the previous one. Without CoT, the transformer would attempt to solve the problem in one step, making it difficult to track intermediate results. With CoT, the model generates intermediate permutations after each step, making it easier to solve the problem accurately.
In the iterated squaring task, the model must repeatedly square a number modulo a prime, which is a deeply serial task. Without CoT, the model would struggle to maintain the intermediate results needed to perform the squaring operation multiple times. CoT allows the model to generate each intermediate squared value, which is then used in the next step, significantly improving the model's ability to handle the task.
Finally, in the circuit value problem, the model is tasked with evaluating a Boolean circuit gate-by-gate. This problem is a classic example of a serial computation task because each gate’s output depends on the previous gates' outputs. With CoT, the transformer can evaluate each gate one at a time, using the result of each intermediate step to inform the next, making it possible to solve the problem step-by-step rather than all at once.

Conclusion

Chain of Thought is a powerful technique that allows transformers to move beyond their limitations in parallel computation and tackle inherently serial tasks. By breaking problems into intermediate reasoning steps, CoT enables transformers to handle tasks like permutation composition, iterated squaring, and circuit evaluation, which would otherwise be difficult or impossible for them to solve. The theoretical and empirical results in the paper demonstrate that with polynomially many CoT steps, transformers can simulate complex, multi-step reasoning tasks, unlocking new possibilities for solving serial problems that require deep reasoning.


Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.