A deep dive into the novel Chain-of-Draft (CoD) Prompting that outperforms Chain-of-Thought (CoT) Prompting while reducing LLM inference cost and latency like never before.

Reasoning LLMs are a hot topic in AI research today.

We started all the way from GPT-1 to arrive at advanced reasoners like Grok-3.

This journey has been remarkable, with some really important reasoning approaches discovered along the way.

One of them has been Chain-of-Thought (CoT) Prompting (Few-shot and Zero-shot), leading to much of the LLM reasoning revolution that we see today.

Excitingly, there’s now an even better technique published by researchers from Zoom Communications.

This technique, called Chain-of-Draft (CoD) Prompting, outperforms CoT Prompting in accuracy, using as little as 7.6% of all reasoning tokens when answering a query.

Comparison of Accuracy and Token Usage when Claude 3.5 Sonnet is prompted using direct answer (Standard), Chain of Thought (CoT), and Chain of Draft (CoD) to solve tasks in different reasoning domains

This is a big win for reasoning LLMs that are currently very verbose, require lots of computational time, and come with high latency, a bottleneck in many real-world time-critical applications.

Here is a story where we deep dive into how Chain-of-Draft (CoD) Prompting works and how you can use it to make your LLMs more accurate and token-efficient than ever.

But First, Let’s Talk Prompting

Researchers have constantly discovered new behaviours in LLMs.

Transformers led us to Generative Pre-trained Transformers or GPT, and we soon discovered that scaling it to GPT-2 (1.5 billion parameters) led it to act as an unsupervised multi-task learner (performing multiple tasks without supervised learning/ fine-tuning on task-specific datasets).

With further scaling to GPT-3 (175 billion parameters), it was found that the model could quickly adapt and perform well on new tasks with only a few examples provided in the input prompt (Few-shot Prompting).

It was then discovered that breaking down problem-solving into intermediate reasoning steps and prompting a large language model (LLM) to generate these steps led to achieving state-of-the-art performance in arithmetic, commonsense, and symbolic reasoning tasks.

This approach is called Chain-of-Thought (CoT) Prompting.

Example of standard and Chain-of-Thought prompting (Image from ArXiv research paper titled ‘Chain-of-Thought Prompting Elicits Reasoning in Large Language Models’)

Following CoT, it was soon found that LLMs were Zero-shot reasoners.

As in the original CoT prompting approach, they need not be prompted with few-shot reasoning examples for better performance.

Simply adding the phrase “Let’s think step by step” to a prompt could make them reason step-by-step while solving a problem.

This approach is called Zero-shot Chain of Thought Prompting.

Comparison between standard Zero-shot and Few-shot prompting, original CoT prompting (shown as “(b) Few-shot-CoT”) and Zero-shot CoT prompting (Image from ArXiv research paper titled ‘Large Language Models are Zero-Shot Reasoners’)

Researchers then realised that reasoning in a chain and greedy decoding towards the answer was not enough.

Complex reasoning tasks could have multiple reasoning paths that reach a correct answer, and if multiple paths lead to the same answer, we can be confident that the final answer is correct.

This led to a new decoding strategy called Self-Consistency, which samples the model to generate multiple reasoning paths and chooses the most consistent answer out of them.

Greedy Decoding in CoT prompting vs. Self-Consistency (Image from ArXiv research paper titled ‘Self-Consistency Improves Chain of Thought Reasoning in Language Models’)

Prompting Architectures Make Their Way

Following this approach of considering multiple reasoning paths in problem-solving, the Tree-of-Thoughts (ToT) framework was introduced, which explores the solution space using a tree-like thought process.

Tree-of-Thought framework (Image from ArXiv reserach paper titled ‘Large Language Model Guided Tree-of-Thought’)

It uses language sequences called “Thoughts” as intermediate steps when solving a problem. These are evaluated and explored using search algorithms with lookahead and backtracking when needed.

Comparison of various reasoning approaches (Image from ArXiv research paper titled ‘Tree of Thoughts: Deliberate Problem Solving with Large Language Models’)

The Tree architecture was replaced by a Graph, leading to the Graph-of-Thoughts framework to better model the solution space.

Graph-of-Thought compared with other reasoning approaches (Image from ArXiv research paper titled ‘Graph of Thoughts: Solving Elaborate Problems with Large Language Models’)

But this is not all!

Prompting is not the only way to help LLMs reason better, and there are so many other techniques, a brief overview of which can be found here.

But What About Latency?

Exploring the reasoning space is a computationally expensive task that increases response latency.

A workaround to reduce latency called Skeleton-of-Thought (SoT) was introduced, which first guides LLMs to generate the skeleton/ outline of the answer.

It then conducts parallel API calls/ batched decoding to complete the contents of each skeleton point in parallel.

Overview of Skeleton-of-Thought (SoT) in comparison to standard decoding (Image from ArXiv research paper titled ‘Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation’)

Reasoning models can also overthink simple problems, generating unnecessary reasoning tokens, leading to high query-to-response time.

Generated tokens on the question — “What is the answer of 2 plus 3?” (Image from ArXiv research paper titled ‘Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs’)

Isn’t it crazy how the QwQ-32-B-Preview model reasons to solve this simple problem of adding 2 and 3?

QwQ-32-B-Preview overthinking on a simple arithmetic problem (Image from ArXiv research paper titled ‘Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs’)

Reserachers have tried to address this issue by limiting the reasoning token budget, but often, LLMs fail to adhere to this.

An additional LLM has also been used to dynamically estimate the token budget for different problems based on their complexity before answering them, but this further increases response latency.

Overview of Token-Budget-Aware LLM Reasoning (TALE) with Estimation and Prompting (Image from ArXiv research paper titled ‘Token-Budget-Aware LLM Reasoning’)

Could we combine all of these insights and somehow simplify them to reach a single approach?

Here Comes “Chain-of-Draft” Prompting

Going back to the basics, Chain-of-Thought (CoT) is a really amazing prompting approach for better LLM reasoning.

However, it is verbose, with an LLM producing thousands of reasoning tokens before reaching an answer.

This is very different from how humans think and reason.

Instead of reasoning in extensively verbose language, we usually jot down the most essential intermediate points (drafts) when thinking.

This is what Chain-of-Draft (CoD) Prompting is inspired by.

It simply asks the model to think step-by-step and to limit each reasoning step to five words at most.

To ensure the model understands this, Few-shot examples of such Chain-of-Drafts are written manually by the researchers and given in the prompt.

It is surprising to know that such a limitation is not enforced in any way, and the model is just prompted by this as a general guideline.

This contrasts standard few-shot prompting, where query-response pairs are given in the prompt, and the model is asked to directly return the final answer without any reasoning or explanation.

This is also different from Chain-of-Thought prompting, where intermediate reasoning steps are given in the query-response pairs in the prompt, and the model is asked to answer the question.

The difference between these approaches is better appreciated in the images below, where an LLM is asked to solve a simple arithmetic problem.

By admin

Leave a Reply

Your email address will not be published. Required fields are marked *