veniceai · sabrinaaquino · Jun 26, 2026 · Jun 26, 2026 · Jun 29, 2026 · Jun 29, 2026
diff --git a/api-reference/api-spec.mdx b/api-reference/api-spec.mdx
@@ -1,34 +1,15 @@
 ---
 title: Introduction
-description: Venice API reference covering authentication, debugging, OpenAI compatibility, response headers, error handling, and the full list of supported endpoints.
+description: Reference documentation for the Venice API
 "og:title": "API Reference | Venice API Docs"
 "og:description": "Complete API reference including authentication, debugging, OpenAI compatibility, and response headers"
 ---
 
-The Venice API offers HTTP-based REST and streaming interfaces for building AI applications with uncensored models and private inference. You can create with text generation, image creation, embeddings, and more, all without restrictive content policies. Integration examples and SDKs are available in the [documentation](/overview/getting-started). Our API reference is also available as a [OpenAPI YAML spec.](https://api.venice.ai/doc/api/swagger.yaml)
+The Venice API is a REST API at `https://api.venice.ai/api/v1` for private inference across uncensored, frontier models. It implements the OpenAI specification, so existing OpenAI clients and SDKs work by changing the base URL — then adds Venice-native features like server-side web search, characters, and wallet payments. The full spec is also available as an [OpenAPI YAML file](https://api.venice.ai/doc/api/swagger.yaml).
 
-## Authentication
-
-The Venice API uses API keys for authentication. Create and manage your API keys in your [API settings](https://venice.ai/settings/api).
-
-
-All API requests require HTTP Bearer authentication:
+## Quickstart
 
-```
-Authorization: Bearer VENICE_API_KEY
-```
-
-<Note>
-Your API key is a secret. Do not share it or expose it in any client-side code.
-</Note>
-
-## OpenAI Compatibility
-
-Venice's API implements the OpenAI API specification, ensuring compatibility with existing OpenAI clients and tools. This allows you to integrate with Venice using the familiar OpenAI interface while accessing Venice's unique features and uncensored models.
-
-### Setup
-
-Configure your client to use Venice's base URL (`https://api.venice.ai/api/v1`) and make your first request:
+Point any OpenAI-compatible client at Venice's base URL (`https://api.venice.ai/api/v1`). Create and manage keys in your [API settings](https://venice.ai/settings/api).
 
 <CodeGroup>
 ```bash curl
@@ -75,7 +56,28 @@
 ```
 </CodeGroup>
 
-## Venice-Specific Features
+## Authentication
+
+All API requests require HTTP Bearer authentication:
+
+```
+Authorization: Bearer VENICE_API_KEY
+```
+
+<Note>
+Your API key is a secret. Do not share it or expose it in any client-side code.
+</Note>
+
+## Differences from OpenAI
+
+Venice is OpenAI-compatible. The main additions and differences:
+
+- **`venice_parameters`** — Venice-only request options (web search, scraping, character personas, thinking controls). [Reference below](#venice-parameters).
+- **System prompts** — Venice appends defaults tuned for natural, uncensored output; disable with `include_venice_system_prompt: false`. [Details below](#system-prompts).
+- **Models** — Use Venice model IDs directly rather than OpenAI mappings. [Browse models](/models/overview).
+- **Response headers** — Balance, rate-limit, model, and content-safety metadata on every response. [Reference below](#response-headers).
+- **Private inference** — TEE-backed and end-to-end-encrypted model options. [Privacy models](/guides/features/tee-e2ee-models).
+- **Payments** — Credits, a daily DIEM allowance, or per-request USDC via x402. [x402 guide](/guides/integrations/x402-venice-api).
 
 ### System Prompts
 
@@ -99,7 +101,7 @@
      {"role": "system", "content": "Your custom system prompt"},
      {"role": "user", "content": "Why is the sky blue?"}
    ],
    "venice_parameters": {
      "include_venice_system_prompt": false
    }
  }'
@@ -151,7 +153,7 @@
 | `disable_thinking` | boolean | On supported reasoning models, disable thinking and strip the `<think></think>` blocks from the response | `false` |
 | `enable_web_search` | string | Enable web search for this request (`off`, `on`, `auto` - auto enables based on model's discretion)<br/>Additional usage-based pricing applies, see [pricing](/overview/pricing#web-search-and-scraping). | `off` |
 | `enable_web_scraping` | boolean | Enable web scraping of up to 5 URLs detected in the user message. Scraped content augments responses and bypasses web search. Only successfully scraped URLs are billed.<br/>Additional usage-based pricing applies, see [pricing](/overview/pricing#web-search-and-scraping). | `false` |
 | `enable_x_search` | boolean | Enable xAI's native search (web + X/Twitter) for supported Grok models (e.g., `grok-4-20-beta`). Provides higher quality search results by using xAI's search infrastructure. When enabled, Venice's standard web search is bypassed.<br/>Additional usage-based pricing applies, see [pricing](/overview/pricing#web-search-and-scraping). | `false` |
 | `enable_web_citations` | boolean | When web search is enabled, request that the LLM cite its sources using `[REF]0[/REF]` format | `false` |
 | `include_search_results_in_stream` | boolean | Experimental: Include search results in the stream as the first emitted chunk | `false` |
 | `return_search_results_as_documents` | boolean | Surface search results in an OpenAI-compatible tool call named `venice_web_search_documents` for LangChain integration | `false` |
@@ -171,14 +173,12 @@
 
 See [Prompt Caching](/guides/features/prompt-caching) for details on how caching works, billing, and best practices.
 
-## Response Headers Reference
+## Response Headers
 
-All Venice API responses include HTTP headers that provide metadata about the request, rate limits, model information, and account balance. In addition to error codes returned from API responses, you can inspect these headers to get the unique ID of a particular API request, monitor rate limiting, and track your account balance.
+All Venice API responses include HTTP headers with request, rate-limit, model, and account-balance metadata. In addition to error codes returned from API responses, you can inspect these headers to get the unique ID of a particular request, monitor rate limiting, and track your account balance.
 
 Venice recommends logging request IDs (`CF-RAY` header) in production deployments for more efficient troubleshooting with our support team, should the need arise.
 
-The table below provides a comprehensive reference of all headers you may encounter:
-
 | Header | Type | Purpose | When Returned |
 |--------|------|---------|---------------|
 | **Standard HTTP Headers** ||||
@@ -190,7 +190,7 @@
 | `CF-RAY` | string | Unique identifier for this API request, used for troubleshooting and support requests | Always |
 | `x-venice-version` | string | Current version/revision of the Venice API service (e.g., `20250828.222653`) | Always |
 | `x-venice-timestamp` | string | Server timestamp when the request was processed (ISO 8601 format) | When timestamp tracking is enabled |
 | `x-venice-host-name` | string | Hostname of the server that processed the request | Error responses and debugging scenarios |
 | **Model Information** ||||
 | `x-venice-model-id` | string | Unique identifier of the AI model used for the request (e.g., `venice-01-lite`) | Inference endpoints using AI models |
 | `x-venice-model-name` | string | Friendly/display name of the AI model used (e.g., `Venice Lite`) | Inference endpoints using AI models |
@@ -219,7 +219,7 @@
 | `x-venice-is-adult-model-content-violation` | string | Indicates if content violates adult model content policies (`true`/`false`) | Image generation endpoints |
 | `x-venice-contains-minor` | string | Indicates if image contains minors (`true`/`false`) | Image analysis endpoints with age detection |
 | **Client Information** ||||
 | `x-venice-middleface-version` | string | Version of the Venice middleface client | Requests from Venice middleface clients |
 | `x-venice-mobile-version` | string | Version of the Venice mobile app client | Requests from mobile applications |
 | `x-venice-request-timestamp-ms` | number | Client-provided request timestamp in milliseconds | When client provides timestamp in request |
 | `x-venice-control-instance` | string | Control instance identifier for debugging | Image generation endpoints for debugging |
@@ -227,16 +227,14 @@
 | `x-auth-refreshed` | string | Indicates authentication token was refreshed during request (`true`/`false`) | When authentication tokens are auto-refreshed |
 | `x-retry-count` | number | Number of retry attempts for the request | When request retries occur |
 
-### Important Notes
+<Accordion title="Notes and an example of accessing headers">
 
 - **Header Name Case**: HTTP headers are case-insensitive, but Venice uses lowercase with hyphens for consistency
 - **String Values**: Boolean values in headers are returned as strings (`"true"` or `"false"`)
 - **Numeric Values**: Large numbers and balance values may be returned as strings to prevent precision loss
 - **Optional Headers**: Not all headers are returned in every response; presence depends on the endpoint and request context
 - **Compression**: Use `Accept-Encoding: gzip, br` in requests to receive compressed responses where supported
 
-### Example: Accessing Response Headers
-
 ```javascript
 // After making an API request, access headers from the response object
 const requestId = response.headers.get('CF-RAY');
@@ -250,37 +248,29 @@
   console.warn(`Model Deprecation: ${deprecationWarning}`);
 }
 ```
+</Accordion>
 
 ## Best Practices
 
 1. **Rate Limiting**: Monitor `x-ratelimit-remaining-requests` and `x-ratelimit-remaining-tokens` headers and implement exponential backoff
 2. **Balance Monitoring**: Track `x-venice-balance-usd` and `x-venice-balance-diem` headers to avoid service interruptions
 3. **System Prompts**: Test with and without Venice's system prompts to find the best fit for your use case
 4. **API Keys**: Keep your API keys secure and rotate them regularly
 5. **Request Logging**: Log `CF-RAY` header values for troubleshooting with support
 6. **Model Deprecation**: Check for `x-venice-model-deprecation-warning` headers when using models
 
-## Differences from OpenAI's API
-
-While Venice maintains high compatibility with the OpenAI API specification, there are some key differences:
-
-1. **venice_parameters**: Additional configurations like `enable_web_search`, `character_slug`, and `strip_thinking_response` for extended functionality
-2. **System Prompts**: Venice appends your system prompts to defaults that optimize for uncensored responses (disable with `include_venice_system_prompt: false`)
-3. **Model Ecosystem**: Venice offers its own [model lineup](/overview/models) including uncensored and reasoning models - use Venice model IDs rather than OpenAI mappings
-4. **Response Headers**: Unique headers for balance tracking (`x-venice-balance-usd`, `x-venice-balance-diem`), model deprecation warnings, and content safety flags
-5. **Content Policies**: More permissive policies with dedicated uncensored models and optional content filtering
-
 ## API Stability
 
 Venice maintains backward compatibility for v1 endpoints and parameters. For model lifecycle policy, deprecation notices, and migration guidance, see [Deprecations](/overview/deprecations).
 
-## OpenAPI Specification & Raw Data
+## Next Steps
 
-For programmatic access to Venice API docs and data — including use with RAG (Retrieval-Augmented Generation) — the following resources are available:
+- [Quickstart guide](/overview/getting-started) — from API key to a working integration
+- [Endpoints](/api-reference/endpoint/chat/completions) — full reference with an interactive playground
+- [Models](/models/overview) — the full model catalog with pricing and capabilities
+- [Rate limits](/api-reference/rate-limiting) — per-model request and token limits
+- [OpenAPI spec (YAML)](https://api.venice.ai/doc/api/swagger.yaml) — the complete specification for codegen and RAG
 
-* [OpenAPI Spec (YAML)](https://api.venice.ai/doc/api/swagger.yaml) — the full API specification in YAML format
-* [API Docs Source](https://github.com/veniceai/api-docs/archive/refs/heads/main.zip) — all documentation pages (`.mdx` format) as a downloadable archive
-
----
+Questions or feedback? Join us on [Discord](https://discord.gg/askvenice).
 
 <sub>Request fields not listed in this documentation may be passed through but are not validated or guaranteed to work.</sub>
diff --git a/api-reference/rate-limiting.mdx b/api-reference/rate-limiting.mdx
@@ -1,10 +1,18 @@
 ---
 title: "Rate Limits"
-description: "Venice API rate limits — per-tier request and token quotas, model-specific limits, headers exposing remaining capacity, and how to handle 429 responses."
+description: "Request and token rate limits for the Venice API."
 "og:title": "Rate Limits | Venice API Docs"
 ---
 
-Rate limits vary by model and tier. The default limits below are a useful reference, but the `/api_keys/rate_limits` API endpoint is the canonical way to fetch your current limits. You can check your exact limits anytime:
+The limits on this page apply to the **standard paid tier** — any funded Venice account using the API lands here. There is no separate lower API tier; [partners](#partner-tier) sit above this tier with higher limits.
+
+How limits are applied:
+
+- **Per model.** Each model resolves to a model-specific override if one exists, otherwise a default based on the model's **size class** and **type**.
+- **Text models enforce both limits.** A requests-per-minute (RPM) *and* a tokens-per-minute (TPM) limit apply — whichever you hit first returns a `429`.
+- **Video and music models are not rate-limited.** They're priced by usage/credits instead.
+
+The default limits below are a useful reference, but the `/api_keys/rate_limits` endpoint is the canonical way to fetch your current limits:
 
 <CardGroup cols={2}>
   <Card title="View Your Limits" icon="gauge-high" href="/api-reference/endpoint/api_keys/rate_limits?playground=open">
@@ -22,42 +30,53 @@
 
 ## Default Limits
 
-### Text Models
+### Text & Embedding Models
 
-Text models are grouped into tiers based on size. Each model card on the [Models page](/models/text) displays its tier badge.
+Text, reasoning, and embedding models are grouped into size classes. Each model card on the [Models page](/models/text) displays its size badge.
 
-| Tier | Requests/min | Tokens/min |
-|:-----|-------------:|-----------:|
-| XS | 500 | 1,000,000 |
-| S | 75 | 750,000 |
-| M | 50 | 750,000 |
-| L | 20 | 500,000 |
+| Size class | Requests/min | Tokens/min |
+|:-----------|-------------:|-----------:|
+| X-Small | 500 | 5,000,000 |
+| Small | 150 | 3,000,000 |
+| Medium | 100 | 2,000,000 |
+| Large | 100 | 2,000,000 |
 
-<Accordion title="Which models are in each tier?">
+Some upstream providers run on their own classes instead of the size-class defaults:
 
-**XS** `qwen3-4b` `llama-3.2-3b`
+| Provider class | Requests/min | Tokens/min |
+|:---------------|-------------:|-----------:|
+| DeepInfra | 150 | 10,000,000 |
+| Anthropic (direct) | 500 | 5,000,000 |
+| xAI (direct) | 500 | 10,000,000 |
+| OpenRouter | 1,000 | None |
+| Parasail | 1,000 | None |
 
-**S** `mistral-31-24b` `venice-uncensored`
+### Image Models
 
-**M** `zai-org-glm-5` `qwen3-next-80b` `google-gemma-3-27b-it`
+Covers image generation, upscaling, and inpainting.
 
-**L** `qwen3-235b-a22b-instruct-2507` `qwen3-235b-a22b-thinking-2507` `deepseek-ai-DeepSeek-R1` `grok-41-fast` `kimi-k2-thinking` `gemini-3-pro-preview` `hermes-3-llama-3.1-405b` `qwen3-coder-480b-a35b-instruct` `zai-org-glm-4.7` `openai-gpt-oss-120b`
+| Class | Requests/min |
+|:------|-------------:|
+| Default | 20 |
+| Bytedance | 50 |
+| Fal | 50 |
+| xAI grok-imagine | 120 |
+| xAI grok-imagine (pro) | 20 |
 
-</Accordion>
-
-### Other Models
+### Audio Models
 
 | Type | Requests/min |
 |:-----|-------------:|
-| Image | 20 |
-| Audio | 60 |
-| Embedding | 500 |
-| Video (queue) | 40 |
-| Video (retrieve) | 120 |
+| Text-to-speech (TTS) | 60 |
+| Speech-to-text (ASR) | 60 |
+
+### Video & Music Models
+
+Not rate-limited — these are billed by usage/credits.
 
 ## Handling Errors
 
 Failed requests (500, 503, 429) should be retried with exponential backoff.

 For 429 errors specifically, check the `x-ratelimit-reset-requests` header for the exact Unix timestamp when you can retry. Most HTTP libraries have built-in retry mechanisms that handle this automatically.

@@ -84,21 +103,6 @@
 
 ## Partner Tier
 
-Partners get significantly higher rate limits:
-
-| Tier | Requests/min | Tokens/min |
-|:-----|-------------:|-----------:|
-| XS | 500 | 2,000,000 |
-| S | 150 | 1,500,000 |
-| M | 100 | 1,500,000 |
-| L | 60 | 1,000,000 |
-
-| Type | Requests/min |
-|:-----|-------------:|
-| Image | 60 |
-| Audio | 120 |
-| Embedding | 500 |
+Partners (`partner-tier-1`) sit above the standard tier with significantly higher limits, tuned to their specific usage.
 
 If you're consistently hitting your rate limits and your usage patterns show **sustained demand over time**, reach out to discuss partner access: [api@venice.ai](mailto:api@venice.ai).
-
-Partner tier limits can be adjusted based on your specific needs.