Merge pull request #107 from owndev/copilot/add-thinking-levels-budgets

owndev · web-flow · commit c4d6d1db0707 · 2025-11-25T09:39:39.000+01:00
Add thinking levels and budgets support for Gemini models
diff --git a/docs/google-gemini-integration.md b/docs/google-gemini-integration.md
@@ -26,7 +26,10 @@ This integration enables **Open WebUI** to interact with **Google Gemini** model
 > Streaming is automatically disabled for image generation models to prevent chunk size issues.
 
 - **Thinking Support**  
-  Support reasoning and thinking steps, allowing models to break down complex tasks.
+  Support reasoning and thinking steps, allowing models to break down complex tasks. Includes configurable thinking levels for Gemini 3 Pro ("low"/"high") and thinking budgets (0-32768 tokens) for other thinking-capable models.
+
+  > [!Note]
+  > **Thinking Levels vs Thinking Budgets**: Gemini 3 Pro models use `thinking_level` ("low" or "high"), while other models like Gemini 2.5 use `thinking_budget` (token count). See [Gemini Thinking Documentation](https://ai.google.dev/gemini-api/docs/thinking) for details.
 
 - **Multimodal Input Support**  
   Accepts both text and image data for more expressive interactions with configurable image optimization.
@@ -123,6 +126,20 @@ GOOGLE_IMAGE_UPLOAD_FALLBACK=true
 # Default: true
 GOOGLE_INCLUDE_THOUGHTS=true
 
+# Thinking budget for Gemini 2.5 models (not used for Gemini 3 models)
+# -1 = dynamic (model decides), 0 = disabled, 1-32768 = fixed token limit
+# Default: -1 (dynamic)
+# Note: Gemini 3 models use GOOGLE_THINKING_LEVEL instead
+GOOGLE_THINKING_BUDGET=-1
+
+# Thinking level for Gemini 3 models only
+# Valid values: "low", "high", or empty string for model default
+# - "low": Minimizes latency and cost, suitable for simple tasks
+# - "high": Maximizes reasoning depth, ideal for complex problem-solving
+# Default: "" (empty, uses model default)
+# Note: This setting is ignored for non-Gemini 3 models
+GOOGLE_THINKING_LEVEL=""
+
 # Enable streaming responses globally
 # Default: true
 GOOGLE_STREAMING_ENABLED=true
@@ -227,3 +244,121 @@ To use this filter, ensure it's enabled in your Open WebUI configuration. Then,
 ## Native tool calling support
 
 Native tool calling is enabled/disabled via the standard 'Function calling' Open Web UI toggle.
+
+## Thinking Configuration
+
+The Google Gemini pipeline supports advanced thinking configuration to control how much reasoning and computation is applied by the model.
+
+> [!Note]
+> For detailed information about thinking capabilities, see the [Google Gemini Thinking Documentation](https://ai.google.dev/gemini-api/docs/thinking).
+
+### Thinking Levels (Gemini 3 models)
+
+Gemini 3 models support the `thinking_level` parameter, which controls the depth of reasoning:
+
+- **`"low"`**: Minimizes latency and cost, suitable for simple tasks, chat, or high-throughput APIs.
+- **`"high"`**: Maximizes reasoning depth, ideal for complex problem-solving, code analysis, and agentic workflows.
+
+> [!Note]
+> Gemini 3 models use `thinking_level` and do **not** use `thinking_budget`. The thinking budget setting is ignored for Gemini 3 models.
+
+Set via environment variable:
+
+```bash
+# Use low thinking level for faster responses
+GOOGLE_THINKING_LEVEL="low"
+
+# Use high thinking level for complex reasoning
+GOOGLE_THINKING_LEVEL="high"
+```
+
+**Example API Usage:**
+
+```python
+from google import genai
+from google.genai import types
+
+client = genai.Client()
+
+response = client.models.generate_content(
+    model="gemini-3-pro-preview",
+    contents="Provide a list of 3 famous physicists and their key contributions",
+    config=types.GenerateContentConfig(
+        thinking_config=types.ThinkingConfig(thinking_level="low")
+    ),
+)
+
+print(response.text)
+```
+
+### Thinking Budget (Gemini 2.5 models)
+
+For Gemini 2.5 models, you can control the maximum number of tokens used during internal reasoning using `thinking_budget`:
+
+- **`0`**: Disables thinking entirely for fastest responses
+- **`-1`**: Dynamic thinking (model decides based on query complexity) - default
+- **`1-32768`**: Fixed token limit for reasoning
+
+> [!Note]
+> Gemini 3 models do **not** use `thinking_budget`. Use `GOOGLE_THINKING_LEVEL` for Gemini 3 models instead.
+
+Set via environment variable:
+
+```bash
+# Disable thinking for fastest responses
+GOOGLE_THINKING_BUDGET=0
+
+# Use dynamic thinking (model decides)
+GOOGLE_THINKING_BUDGET=-1
+
+# Set a specific token budget for reasoning
+GOOGLE_THINKING_BUDGET=1024
+```
+
+**Example API Usage:**
+
+```python
+from google import genai
+from google.genai import types
+
+client = genai.Client()
+
+# Example with a specific thinking budget
+response = client.models.generate_content(
+    model="gemini-2.5-pro",
+    contents="Provide a list of 3 famous physicists and their key contributions",
+    config=types.GenerateContentConfig(
+        thinking_config=types.ThinkingConfig(thinking_budget=1024)
+    ),
+)
+print(response.text)
+
+# Turn off thinking entirely
+response = client.models.generate_content(
+    model="gemini-2.5-pro",
+    contents="What is 2+2?",
+    config=types.GenerateContentConfig(
+        thinking_config=types.ThinkingConfig(thinking_budget=0)
+    ),
+)
+print(response.text)
+
+# Use dynamic thinking (model decides based on query complexity)
+response = client.models.generate_content(
+    model="gemini-2.5-pro",
+    contents="Explain quantum computing",
+    config=types.GenerateContentConfig(
+        thinking_config=types.ThinkingConfig(thinking_budget=-1)
+    ),
+)
+print(response.text)
+```
+
+### Model Compatibility
+
+| Model | thinking_level | thinking_budget |
+|-------|---------------|-----------------|
+| gemini-3-* | ✅ Supported ("low", "high") | ❌ Not used |
+| gemini-2.5-* | ❌ Not used | ✅ Supported (0-32768) |
+| gemini-2.5-flash-image-* | ❌ Not supported | ❌ Not supported |
+| Other models | ❌ Not used | ✅ May be supported |
diff --git a/filters/vertex_ai_search_tool.py b/filters/vertex_ai_search_tool.py
@@ -41,4 +41,3 @@ def inlet(self, body: dict) -> dict:
                         "vertex_ai_search enabled but vertex_rag_store not provided in params or VERTEX_AI_RAG_STORE env var"
                     )
         return body
-
diff --git a/pipelines/google/google_gemini.py b/pipelines/google/google_gemini.py

Original file line number	Diff line number	Diff line change
`@@ -41,4 +41,3 @@ def inlet(self, body: dict) -> dict:`
`41`	`41`	`"vertex_ai_search enabled but vertex_rag_store not provided in params or VERTEX_AI_RAG_STORE env var"`
`42`	`42`	`)`
`43`	`43`	`return body`
`44`		`-`