Extended Thinking

Overview

Extended thinking allows the LLM to engage in deeper reasoning before responding. When enabled, the model thinks through problems step-by-step internally, and you can see the thinking process in real-time via streaming.

Extended thinking is supported by both AnthropicProvider and GeminiProvider.

Enabling Extended Thinking

let config = AgentConfig::new("You are a thoughtful assistant.")
    .with_thinking(16000);  // Budget in tokens

The budget controls how many tokens Claude can use for thinking. Claude may use less than the budget depending on problem complexity.

Problem Type	Recommended Budget
Moderate complexity	8,000-16,000
Complex reasoning	16,000-32,000
Very difficult problems	32,000-64,000

Larger budgets increase API costs. Choose the smallest budget that solves your problem.

Processing Thinking Output

When extended thinking is enabled, you receive additional chunk types:

let mut rx = handle.subscribe();
handle.send_input("Solve this complex problem...").await?;

while let Ok(chunk) = rx.recv().await {
    match chunk {
        OutputChunk::ThinkingDelta(text) => {
            // Stream thinking in real-time
            ui.append_thinking(&text);
        }

        OutputChunk::ThinkingComplete(full_thinking) => {
            // Thinking phase complete
            ui.show_thinking_complete();
        }

        OutputChunk::TextDelta(text) => {
            // Regular response text
            ui.append_response(&text);
        }

        OutputChunk::Done => break,
        _ => {}
    }
}

Thinking blocks are also preserved in conversation history as ContentBlock::Thinking.

Usage Tracking

Thinking tokens are reported separately in the usage metadata:

// After receiving a response
let response = handle.get_last_response().await?;

println!("Input tokens: {}", response.usage.input_tokens);
println!("Output tokens: {}", response.usage.output_tokens);

if let Some(thinking_tokens) = response.usage.thoughts_token_count {
    println!("Thinking tokens: {}", thinking_tokens);
}

Gemini models: Thinking tokens are counted as part of the response. You’re charged for both output tokens and thinking tokens.

Provider-Specific Behavior

AnthropicProvider

Uses the thinking budget directly as token count
Supports budgets from 1 to 64,000 tokens

GeminiProvider

Gemini 3 models (gemini-3-flash, gemini-3-pro):
- Budget automatically converted to thinking levels (minimal, low, medium, high)
- Conversion ranges:
  - 0-512 → minimal
  - 513-2048 → low
  - 2049-8192 → medium
  - 8193+ → high
Gemini 2.5 models (gemini-2.5-flash, gemini-2.5-pro):
- Uses numeric budget directly (0-24,576 tokens)
- Set to 0 to disable thinking

The conversion is automatic based on the model name - you use the same API regardless of provider.

Limitations

Non-streaming mode: Interrupts are not detected during the thinking phase when streaming is disabled.
Budget is a maximum: The model may use fewer tokens than the budget for simpler problems.
Tool calls: Extended thinking applies to LLM responses, not during tool execution.

Get Started

Core Concepts

Features

Tools

Permissions

Hooks

MCP

LLM Providers

Advanced

Integration

Extended Thinking

Overview

Enabling Extended Thinking

Processing Thinking Output

Usage Tracking

Provider-Specific Behavior

AnthropicProvider

GeminiProvider

Limitations

Next Steps

Prompt Caching

Streaming & History

​Overview

​Enabling Extended Thinking

​Processing Thinking Output

​Usage Tracking

​Provider-Specific Behavior

​AnthropicProvider

​GeminiProvider

​Limitations

​Next Steps

Prompt Caching

Streaming & History

Overview

Enabling Extended Thinking

Processing Thinking Output

Usage Tracking

Provider-Specific Behavior

AnthropicProvider

GeminiProvider

Limitations

Next Steps