Skip to main content

Overview

Extended thinking allows the LLM to engage in deeper reasoning before responding. When enabled, the model thinks through problems step-by-step internally, and you can see the thinking process in real-time via streaming.
Extended thinking is supported by both AnthropicProvider and GeminiProvider.

Enabling Extended Thinking

let config = AgentConfig::new("You are a thoughtful assistant.")
    .with_thinking(16000);  // Budget in tokens
The budget controls how many tokens Claude can use for thinking. Claude may use less than the budget depending on problem complexity.
Problem TypeRecommended Budget
Moderate complexity8,000-16,000
Complex reasoning16,000-32,000
Very difficult problems32,000-64,000
Larger budgets increase API costs. Choose the smallest budget that solves your problem.

Processing Thinking Output

When extended thinking is enabled, you receive additional chunk types:
let mut rx = handle.subscribe();
handle.send_input("Solve this complex problem...").await?;

while let Ok(chunk) = rx.recv().await {
    match chunk {
        OutputChunk::ThinkingDelta(text) => {
            // Stream thinking in real-time
            ui.append_thinking(&text);
        }

        OutputChunk::ThinkingComplete(full_thinking) => {
            // Thinking phase complete
            ui.show_thinking_complete();
        }

        OutputChunk::TextDelta(text) => {
            // Regular response text
            ui.append_response(&text);
        }

        OutputChunk::Done => break,
        _ => {}
    }
}
Thinking blocks are also preserved in conversation history as ContentBlock::Thinking.

Usage Tracking

Thinking tokens are reported separately in the usage metadata:
// After receiving a response
let response = handle.get_last_response().await?;

println!("Input tokens: {}", response.usage.input_tokens);
println!("Output tokens: {}", response.usage.output_tokens);

if let Some(thinking_tokens) = response.usage.thoughts_token_count {
    println!("Thinking tokens: {}", thinking_tokens);
}
Gemini models: Thinking tokens are counted as part of the response. You’re charged for both output tokens and thinking tokens.

Provider-Specific Behavior

AnthropicProvider

  • Uses the thinking budget directly as token count
  • Supports budgets from 1 to 64,000 tokens

GeminiProvider

  • Gemini 3 models (gemini-3-flash, gemini-3-pro):
    • Budget automatically converted to thinking levels (minimal, low, medium, high)
    • Conversion ranges:
      • 0-512minimal
      • 513-2048low
      • 2049-8192medium
      • 8193+high
  • Gemini 2.5 models (gemini-2.5-flash, gemini-2.5-pro):
    • Uses numeric budget directly (0-24,576 tokens)
    • Set to 0 to disable thinking
The conversion is automatic based on the model name - you use the same API regardless of provider.

Limitations

  • Non-streaming mode: Interrupts are not detected during the thinking phase when streaming is disabled.
  • Budget is a maximum: The model may use fewer tokens than the budget for simpler problems.
  • Tool calls: Extended thinking applies to LLM responses, not during tool execution.

Next Steps

Prompt Caching

Reduce costs with caching

Streaming & History

Handle thinking output