Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions docs/native-attachments-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
# Design: Native LLM Attachments over the Private (OHTTP) Path

## Status

Proposal. Spans three repos: `chat-app` (browser), `chat-api` (relay), `tee-gateway`
(enclave). The bulk of the change lands in `tee-gateway`.

## Motivation

Today attachments are handled by **server-side parsing in `chat-api`**:

- `chat-api/src/core/attachments.py` downloads each attachment and runs PyMuPDF /
python-docx to extract **plain text**, then injects that text into the prompt.
- Images are classified by content-type and passed through as URLs.

This is the wrong layer to solve the problem:

1. **It throws away everything the models do natively.** Modern Claude / GPT /
Gemini ingest PDFs and images directly — layout, tables, figures, charts,
handwriting, embedded images. Flattening a PDF to `page.get_text()` loses all
of that and feeds the model a worse input than it could handle itself.
2. **It only works on the non-private path.** The parsing in `attachments.py` is
invoked exclusively from the regular `POST /api/v1/chat` handler. On the
**OHTTP path**, `chat-api` is a dumb relay — it forwards opaque ciphertext to
the enclave and never sees the body — so attachments are simply not processed.
Worse, in the enclave `llm_backend.convert_messages` flattens multimodal
content parts to text only (`"".join(part.get("text", "") ...)`), so any
`image_url` part is **silently dropped** before it reaches the provider.

Net result: **attachments and privacy are currently mutually exclusive.**
Attachments only work on the route where `chat-api` reads the plaintext, and the
private route drops them.

## Goal

Send attachments to the model **natively**, on the **private (OHTTP) path**:

- No server-side text extraction. The file bytes reach the model as a native
image/document content part.
- `chat-api` and Cloudflare never see attachment plaintext (same trust boundary
as the message text already enjoys on OHTTP).
- The enclave converts the inner request's multimodal content into each
provider's native format via LangChain.

## Trust boundary (what this does and does not hide)

- **Hidden from:** the browser→relay transport, `chat-api`, the OHTTP relay,
Cloudflare/R2. They see only HPKE ciphertext.
- **Visible to:** the enclave (it decrypts — that's the trust anchor) and the
**upstream LLM provider** (OpenAI/Anthropic/Google/xAI/ByteDance), which
receives the attachment as part of the completion request. This is identical
to how message *text* is already handled: whatever you send the model, the
model provider sees. Fully provider-blind attachments would require the model
to run inside the TEE and are out of scope here.

## Transport: how the attachment reaches the enclave

### Phase 1 — inline base64 (recommended starting point)

The browser embeds the file directly in the message content as a standard
OpenAI-style content part, inside the HPKE-encrypted OHTTP payload:

```jsonc
{
"model": "claude-sonnet-4-6",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Summarize this contract." },
{ "type": "image_url",
"image_url": { "url": "data:image/png;base64,iVBORw0K..." } },
{ "type": "file",
"file": { "filename": "contract.pdf",
"file_data": "data:application/pdf;base64,JVBERi0..." } }
]
}
]
}
```

- Pros: nothing outside the enclave/provider ever sees the bytes; no R2 round
trip; no presigned-URL machinery; no SSRF surface.
- Cons: base64 inflates ~33%; bounded by request/OHTTP size limits; no
persistence (re-sent each turn). Fine for the common case (a few MB of PDF or
an image). Enforce a hard per-request attachment-bytes cap in the enclave.

### Phase 2 — encrypted blob in R2 (only if large files / persistence needed)

Browser client-side-encrypts the file (AES-GCM), uploads **ciphertext** to R2
(Cloudflare sees only ciphertext), and includes inside the OHTTP payload an R2
reference plus the AES key **wrapped to the TEE attestation/HPKE public key**.
The enclave fetches the ciphertext and decrypts internally. Defer until Phase 1
limits become a real constraint.

> Note: do **not** go back to plaintext-in-R2 + presigned URLs. That reintroduces
> the public-bearer-token leak and the SSRF surface in `attachments.py`.

## Enclave changes (`tee-gateway`) — the core of the work

### 1. `convert_messages` must preserve multimodal content

`llm_backend.py:248-255` currently does:

```python
elif role == "user":
if isinstance(content, list):
content = "".join(
part.get("text", "") if isinstance(part, dict) else str(part)
for part in content
)
langchain_messages.append(HumanMessage(content=content))
```

Replace the flattening with a converter that maps the inbound OpenAI-style
content parts to **LangChain v1 standard content blocks** (`langchain_core.
messages.content` — `ImageContentBlock`, `FileContentBlock`). Building the
*standard* blocks (rather than raw OpenAI `image_url`/`file` dicts) is important:
each provider package translates them into its own native API, so one code path
covers Anthropic, OpenAI, Gemini, and xAI uniformly.

- `text` → `{"type": "text", "text": ...}`
- image (base64 data URI or https) →
`{"type": "image", "base64": ..., "mime_type": "image/png"}` (or `"url": ...`)
- document/PDF (base64) →
`{"type": "file", "base64": ..., "mime_type": "application/pdf",
"filename": "<original name>"}`

Keep a `HumanMessage` with a **list** content when parts are present; only
collapse to a plain string when the message is text-only (preserves current
behavior for the no-attachment case).

**Verified** against the pinned versions (see "Dependency check" below): a
`HumanMessage` carrying these standard blocks converts correctly outbound —
Anthropic emits `{"type":"document","source":{"type":"base64","media_type":
"application/pdf",...}}`, OpenAI emits `{"type":"file","file":{"file_data":
"data:application/pdf;base64,...","filename":...}}`. **Carry the original
`filename`** on file blocks — OpenAI requires one and otherwise substitutes a
placeholder (`LC_AUTOGENERATED`).

### 2. No new dependencies (PCR constraint) — confirmed

Native handoff means the enclave does **not** parse PDFs/DOCX itself — it passes
the bytes to the provider. So we should **not** add PyMuPDF/python-docx to
`tee-gateway`.

**Dependency check (done).** The currently pinned versions already support
standard image *and* file (PDF) content blocks with base64, across every
provider we route to — so **this change needs no dependency bump and the PCR
measurements stay stable**:

| Package | Pinned | Native file/image support |
|---|---|---|
| `langchain-core` | 1.2.26 | Defines `ImageContentBlock` / `FileContentBlock` (base64, url, file_id, mime_type) |
| `langchain-anthropic` | 1.4.0 | `file` → `document` (defaults `application/pdf`); image → base64 source |
| `langchain-openai` | 1.1.12 | `file` → `file_data` data-URI / `input_file`; image → `image_url` |
| `langchain-google-genai` | 4.2.1 | document/image blocks supported |
| `langchain-xai` | 1.2.2 | subclass of `BaseChatOpenAI` → inherits OpenAI handling |

This was verified functionally (not just by reading types) by running the
Anthropic and OpenAI outbound message converters over a multimodal
`HumanMessage`. Per-model *acceptance* of PDFs still depends on the model itself
(see capability gating below).

### 3. Per-provider capability gating

Not every model accepts every modality. Extend `model_registry` with capability
flags (e.g. `supports_image`, `supports_pdf`) and reject (clear 4xx inside the
inner request) when a request sends a modality the target model can't handle,
rather than silently dropping it as today.

### 4. Request signing / hashing

`chat_controller.py` (~645-651) hashes user content via `str(msg.content)`. With
multimodal content that would hash megabytes of base64 and is not canonical.
Define a stable hashing rule, e.g. hash each attachment as
`sha256(mime_type || raw_bytes)` and include those digests (not the base64) in
the canonical request JSON that feeds `keccak256(requestHash ...)`. This keeps
signatures meaningful and bounded while still committing to the exact attachment
content.

### 5. Limits & validation

- Hard cap on total attachment bytes per request (post-decode).
- Allowlist of accepted mime types per modality.
- Reject `image_url` values that are remote `https` URLs on the private path if
we want to guarantee the enclave makes no outbound fetch for user content
(Phase 1 = base64 only). Decide explicitly.

## `chat-api` changes

- OHTTP path: **no change needed** to the relay itself — attachments ride inside
the encrypted payload it already forwards opaquely.
- Regular `POST /api/v1/chat` path: stop calling `load_documents` /
`is_image_url` and stop injecting extracted text. Either (a) build native
content parts here too, or (b) deprecate attachment support on the non-private
path and route all attachments through OHTTP. Recommend (b) for a single code
path.
- The presigned-URL / `attachments: string[]` machinery and `attachments.py`
become dead code for inference and can be removed once Phase 1 ships (R2 may
still be used for chat-history storage — that is a separate concern and should
be client-side-encrypted if kept).

## `chat-app` changes

- Replace "upload to R2 → store presigned URL → send URL in `attachments`" with:
read the file in the browser, base64-encode, and add a native `image_url` /
`file` content part to the outgoing (to-be-encrypted) message.
- Enforce client-side size/type limits matching the enclave caps; surface a clear
error when a file exceeds them.
- Drop the presigned-upload/download hooks from the send path.

## Rollout

1. Enclave: `convert_messages` multimodal support + capability flags + hashing +
limits (behind the existing OHTTP path). Ship and verify PCRs.
2. `chat-app`: send native base64 content parts on the OHTTP path.
3. Remove server-side parsing from `chat-api`; retire `attachments.py` and the
presigned-URL attachment flow.
4. (Optional, later) Phase 2 encrypted-R2-blob for large files.

## Open questions

- ~~Pinned `langchain-*` versions: do they already support `file` (PDF) content
blocks?~~ **Resolved:** yes, all five providers — no dep bump / PCR change
needed (see Dependency check above).
- Hard size cap value for inline attachments, and the OHTTP request size ceiling.
- Keep or drop attachment support entirely on the non-private path?
- Source of truth for per-model `supports_image` / `supports_pdf` flags — note
`langchain-*` ships `ModelProfile` data (e.g. `langchain_xai/data/_profiles`)
that may already encode some of this.
49 changes: 46 additions & 3 deletions tee_gateway/controllers/chat_controller.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import hashlib
import json
import time
import uuid
Expand Down Expand Up @@ -29,6 +30,9 @@
get_chat_model_cached,
convert_messages,
extract_usage,
validate_attachments,
AttachmentValidationError,
_convert_content_part,
Comment on lines +33 to +35
)
from tee_gateway.pricing import compute_session_cost

Expand All @@ -47,6 +51,13 @@ def create_chat_completion(body):
connexion.request.get_json()
)

# Reject attachments the target model can't handle, and enforce the size cap,
# before doing any provider work.
try:
validate_attachments(chat_request.messages, chat_request.model)
except AttachmentValidationError as e:
return {"error": "Invalid attachment", "message": str(e)}, e.status

if chat_request.stream:
return _create_streaming_response(chat_request)
else:
Expand Down Expand Up @@ -636,6 +647,40 @@ def generate():
# ---------------------------------------------------------------------------


def _canonical_user_content(content) -> Any:
"""Canonicalize user-message content for request hashing.

Plain-string content is returned unchanged. For multimodal content (a list of
parts), inline attachment bytes are replaced with a ``sha256`` digest so the
signed request commits to the exact attachment content without bloating the
hashed payload with megabytes of base64. URL / file_id references are kept
verbatim.
"""
if isinstance(content, str):
return content
if not isinstance(content, list):
return str(content)

canonical = []
for part in content:
block = _convert_content_part(part)
if block is None:
continue
if block["type"] == "text":
canonical.append({"type": "text", "text": block.get("text", "")})
continue
entry = {"type": block["type"]}
if "base64" in block:
entry["sha256"] = hashlib.sha256(
block["base64"].encode("utf-8")
).hexdigest()
Comment on lines +673 to +676
for key in ("mime_type", "filename", "url", "file_id"):
if block.get(key):
entry[key] = block[key]
canonical.append(entry)
return canonical


def _chat_request_to_dict(chat_request: CreateChatCompletionRequest) -> dict:
"""Serialize a CreateChatCompletionRequest to a canonical dict for hashing."""
messages = []
Expand All @@ -646,9 +691,7 @@ def _chat_request_to_dict(chat_request: CreateChatCompletionRequest) -> dict:
messages.append(
{
"role": "user",
"content": msg.content
if isinstance(msg.content, str)
else str(msg.content),
"content": _canonical_user_content(msg.content),
}
)
elif isinstance(msg, ChatCompletionRequestAssistantMessage):
Expand Down
Loading
Loading