Expose a tokenizer API (/tokenize + /decode), aligned with TEI/Cohere

## Feature request

Expose a tokenizer API on the inference server: `POST /tokenize` and `POST /decode`, aligned with the shape used by Text Embeddings Inference (TEI) — and equivalently by Cohere (`/tokenize` + `/detokenize`) and vLLM.

- `POST /tokenize` — `{model, inputs, add_special_tokens?}` → per input, a list of `{id, text, special, start, stop}` (with `start`/`stop` as character offsets) plus a token count.
- `POST /decode` — `{model, ids, skip_special_tokens?}` → text.

These primitives cover three client scenarios. In priority order for me:

1. **Truncate** an input to *N* tokens before sending it.
2. **Chunk** one input into *K* segments of at most *N* tokens each.
3. **Tokenize** an input to get token ids and character offsets.

(1) and (2) are my main interest; (3) is the primitive they build on. No popular tokenizer API standardises chunking, so I'd suggest either shipping the primitives only and letting clients chunk from the offsets, or adding a thin, clearly-Infinity-specific convenience layer (e.g. `max_tokens` to head-truncate, `chunk_size`/`chunk_overlap` to split) that reuses the truncation logic already present for reranking. Happy to follow your preference here.

## Motivation

Clients that send long inputs have to guess how the served model's tokenizer treats them. Without a server-side tokenizer endpoint, every client has to ship and pin the matching `tokenizers`/`transformers` version per model, keep it in sync with the model Infinity actually serves, and reimplement chunking. That logic differs per model (BERT vs. XLM-R vs. a custom merge), so it drifts easily and is a common source of "works locally, mismatched in prod" mismatches.

Infinity already holds the right tokenizer in process (`_infinity_tokenizer` in the SentenceTransformer and CrossEncoder backends) and already truncates by tokens for reranking (`truncate_texts_to_tokens`, plus the `max_query_tokens` / `max_tokens_per_doc` / `max_pair_tokens` budgets on `/rerank`). Exposing that capability lets clients reuse the exact tokenizer of the served model instead of duplicating it.

The intent is to match an existing popular API rather than invent a custom one. TEI is the closest precedent, since Infinity already positions itself as a TEI/OpenAI alternative; `/tokenize` + `/decode` carries the character offsets needed for chunking and lets existing TEI clients work unchanged. Cohere's `/tokenize` + `/detokenize` naming is an alternative that matches Infinity's existing Cohere-aligned `/rerank`.

Two open questions for you:

1. TEI-aligned `/tokenize` + `/decode`, or Cohere-aligned `/tokenize` + `/detokenize` (matching the existing `/rerank`)?
2. Primitives only, or include the truncate/chunk convenience layer?

## Your contribution

Yes — I'm happy to open a PR once the endpoint shape and the two questions above are agreed.

References: [TEI API](https://huggingface.github.io/text-embeddings-inference/), [Cohere /tokenize](https://docs.cohere.com/reference/tokenize), [vLLM tokenizer endpoints](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose a tokenizer API (/tokenize + /decode), aligned with TEI/Cohere #667

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Expose a tokenizer API (/tokenize + /decode), aligned with TEI/Cohere #667

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions